-
-
Notifications
You must be signed in to change notification settings - Fork 808
buzhash64 chunker #8903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
buzhash64 chunker #8903
Conversation
|
Some notes: A) Before working on buzhash64, I had a look at fastCDC and gear hashing. But because gear just does a simple bit shift (not a bit rotate as buzhash does), it can only work with window sizes of up to 64 bytes. I had the gut feeling that this limit configurability for the context size for a "chunking point" compared to borg's / buzhash's, which can be configured in a wider range (default: 4095 bytes). B) The chunking "quality" of the new chunker is researched now, see below. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8903 +/- ##
==========================================
+ Coverage 81.83% 82.03% +0.19%
==========================================
Files 77 77
Lines 13466 13481 +15
Branches 1990 1995 +5
==========================================
+ Hits 11020 11059 +39
+ Misses 1773 1757 -16
+ Partials 673 665 -8 ☔ View full report in Codecov by Sentry. |
|
Analysis script's output: |
6f29108 to
3a3aae7
Compare
Added some *64*.* files that are just 1:1 copies of their 32bit counterparts, so that the changes for the 64bit adaption will later be better visible.
the buzhash seed only has 32bits, but we rather want 64bits for buzhash64. just take them from crypt_key for now.
That way we can feed lots of entropy into the table creation. The bh64_key is derived from the id_key (NOT the crypt_key), thus it will create the same key for related repositories (even if they use different encryption/authentication keys). Due to that, it will also create the same buzhash64 table, will cut chunks at the same points and deduplication will work amongst the related repositories.
3a3aae7 to
51f4fe9
Compare
|
Hmm, currently it is using AES-CTR mode, but guess that's wrong because it likely destroys dedup between similar files (e.g. if one has some bytes inserted at the beginning: due to the offsets of later data being shifted, it would get different encryption tokens from the PRF stream and thus (for the sambe buzhash value) would decide differently whether to chunk or not). Maybe even that whole PRF stream approach is wrong and I need to go back just using AES-ECB on each hash value (which is 5x slower due to EVP api call overhead). Update: I removed the AES buzhash encryption from this PR, will continue working on that in a separate PR. |
51f4fe9 to
d23704e
Compare
| # (1) the hash is designed for inputs <= 64 bytes, but the chunker uses it on a 4095 byte window; | ||
| # any repeating bytes at distance 64 within those 4095 bytes can cause cancellation within | ||
| # the hash function, e.g. in "X <any 63 bytes> X", the last X would cancel out the influence | ||
| # of the first X on the hash value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add that the following comment also applies to buzhash64:
# (2) the hash table is supposed to have (according to the BUZ) exactly a 50% distribution of
# 0/1 bit values per position, but the generated table below doesn't fit that property.
Truncated SHA-256 does not guarantee bit-balance.
While it's possible to achieve bit-balance with additional logic — such as seeding a CSPRNG with a key and selecting 128 of 256 bits to be 1 (e.g. using a Fisher–Yates shuffle directly or vendoring some version of random.sample()) — I'm not convinced that such a measure is necessary to produce a good hash. More importantly, it is not sufficient on its own to ensure good hash properties.
For instance, there is an interesting corner case.
Robert Uzgalis originally defined Buzhash() as a 31-bit function ([BUZ]) for reasons unclear to me. I was unable to find the original publications ([GEN], [MYTHS]) that were referenced by Uzgalis himself ([BH]), but it appears this design choice may be related to the signedness of the result rather than any inherent property of the function. John Hamer ([JH]) also implemented it as a 31-bit function, though again, this might be due to Java's use of signed integers.
References:
- [BUZ] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html
- [GEN] Uzgalis RC. General hash functions. Technical Report, Dept. of Computer Science, University of Hong Kong, 1992.
- [MYTHS] Uzgalis and Tong. Hashing Myths. Technical Report 97, Dept. of Computer Science, University of Auckland, July 1994.
- [BH] http://www.serve.net/buz/buz.basic/research.html#hashing
- [JH] John Hamer. Hashing revisited (2002). DOI:10.1145/637610.544440
Interestingly, it is possible to construct substitution tables that satisfy the bit-balance requirement while effectively reducing Buzhash32 down to a 31-bit function. Specifically, if popcount(table[c]) is even for every byte c, then popcount(Buzhash(x)) will also be even for any input. This follows trivially from the fact that both rot() and xor() preserve the parity of the popcount.
Similarly, if popcount(table[c]) is odd for every byte c, then the parity of popcount(Buzhash(x)) will depend on the window size.
However, generating an "Unlucky" table of that kind using SHA(i + key) is really improbable.
So, while strict 50% bit distribution might help reduce bias, it is not sufficient to make Buzhash equidistributed. Also, I’m not aware of a simple way to measure that bias or to prove its presence with confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I missed that pseudo-randomness of course does not guarantee precise 50/50 bit distribution per bit position.
About the 31bit stuff: OFC, I don't know what they were thinking, but I think it was probably to avoid signedness issues.
About reducing it accidentally to a 31bit function: guess the probability for the "even case" is 1 : 2^256 and thus we can safely ignore it? Same for the "odd case".
I removed the sha256 from the code in my new PR. The key fed to the table init function is derived from a 256bit_random_secret.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess the probability for the "even case" is 1 : 2^256
That's right. However, my concern lies in a different area. Buzhash is used in several projects, however I've not found a single publication mentioning alike "unlucky" tables as the possible problem of the hash function. That makes me think that this specific hash function was not well-studied and might have some more pitfalls of higher probability.
I mean, I'm not worried about 1:2^256 case. I'm worried by something unknown and similar to going from window % 64 = 0 being problematic for xor-seed case down to "every even window being problematic" (#8868).
Yes, I think, it's the best (known) way to go for Buzhash tables.
Thanks for your work on Borg! It's truly awesome.
| cdef uint64_t _buzhash64_update(uint64_t sum, unsigned char remove, unsigned char add, size_t len, const uint64_t* h): | ||
| """Update the buzhash with a new byte.""" | ||
| cdef uint64_t lenmod = len & 0x3f | ||
| return BARREL_SHIFT64(sum, 1) ^ BARREL_SHIFT64(h[remove], lenmod) ^ h[add] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to mention a few Buzhash-related optimizations that might help improve chunker performance.
1. Inlining _buzhash_update()
Buzhash is quite fast, even as a one-byte-at-a-time function — it measured at around 2.5 cycles/byte in my benchmarks. As a result, the overhead of calling _buzhash_update() can be comparable to the cost of the actual computation. For this reason, it's important that the compiler can inline _buzhash_update() inside the tight while ... loop.
This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case. However, I noticed you've defined BARREL_SHIFT64() as a macro rather than a static function, which suggests you might already be aware of possible inlining pitfalls — hence why I wanted to raise this just in case.
2. Using a Precomputed Rotated Table
You might also consider trading 2 KiB of memory for a potential speedup.
Since ChunkerBuzHash64() takes window_size as a parameter, you could precompute BARREL_SHIFT64(x, lenmod) for each value, doubling the size of the lookup table to 2 * 256 * sizeof(uint64_t) bytes. If memory serves me well, this optimization gave me an ≈25% speedup.
One more detail: I found that performance varied depending on how the rotated values were stored. Surprisingly, concatenating the original and rotated tables performed better in my tests than interleaving them, though that result was a bit counterintuitive for me.
3. Using a Sentinel to Eliminate Bounds Checks
There’s a more aggressive optimization that avoids bounds checking on each while ... loop iteration.
By appending a sentinel value at the end of the buffer, you can guarantee that a cut point will always be found — either in the actual data, within the sentinel or at the very end of it. Specifically, the sentinel should consist of window_size bytes that produce a Buzhash() result with hash & chunk_mask == 0.
This optimization resulted in an improvement of approximately 0.25 cycles/byte, translating to a ≈10% speedup. That said, YMMV — its effectiveness likely depends on CPU pipelining and the relative sizes of buf_size (1 << chunk_max_exp), window_size, and the CPU's Last-Level Cache (LLC). For example, a larger buf_size means more bounds checks are avoided, while a larger window_size increases the overhead of computing the Buzhash of sentinel. If buf_size exceeds the LLC, you may also encounter RAM-related stalls.
I haven't found an easy way to compute a sentinel directly from the Buzhash table. The options I used were either to memoize the first valid cut point when it occurs, or to generate a random stream of bytes and rely on probability to do its job. The resulting sentinel can then be stored either at runtime ChunkerBuzHash64() object or as part of the repository metadata.
For full transparency: my experiments were done in plain C11, and I don’t have much experience with Cython. Please feel free to disregard any of these suggestions if they don't apply well to your codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your insightful comments, very much appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried 2 things:
- using cdef inline ... for the _buzhash_update function
- manually inlining it into the loop
I didn't see a notable difference on my machine (running the borg benchmark cpu chunker tests for ~10s). Maybe the C compiler is clever enough to already do that.
Or the M3P cpu is very efficient when calling small functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also tried to avoid that the first BARREL_SHIFT macro parameter gets "evaluated" twice due to macro expansion by pre-computing it. It also didn't make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case
Yep, I've verified that. That's the case at least with Cython version 0.29.28. The resulting __pyx_f_4borg_8chunkers_9buzhash64__buzhash64_update() function is static. Turns out, getting C code is as simple as cython3 buzhash64.pyx :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Biggest win (about 15-20% faster) was 8-fold loop unrolling and 8-fold table value prefetching to improve memory access patterns.
But I don't really like it, bloats the src code quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing! It's an interesting outcome to know. I've also tried unrolling & aligned prefetching, but gained nothing measurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't even enforce alignment, just prefetching 8 data bytes / 8 table entries directly after each other.
Maybe ARM cpus behave a bit differently here than Intel/AMD.
Original PR #8903 by ThomasWaldmann Original: borgbackup/borg#8903
Merged from original PR #8903 Original: borgbackup/borg#8903
buzhash table creation
Instead of changing the
table_basewith the pre-computed 32bit "random" values into one with pre-computed 64bit "random" values, I decided to rather drop it from the source code and deterministically generate thetableusingsha256(i, key). Instead of the previous seed it now uses a key that is derived from the ID key.That should also resolve some points of criticism about the old
buzhash32bit code:table_base: that the bits are not randomly distributed enough.Sizes
key / seed: now 256bits, was 32bits.
buzhash: now 64bits, was 32bits.
Performance
(measurements from Apple MBP M3P)
Security?
Guess these changes make it harder against fingerprinting attacks - some independent review about this would be very welcome.
See also: https://github.com/borgbackup/borg/wiki/CDC-issues-reported-2025
First paper, section 3.5 (compression enabled case) suggests using a 64bit hash (if I understood it correctly).
Second paper suggests doing something like AES(buzhash_value) before doing the chunking decision (== for every byte position in a file). That was shifted to a future PR.