Skip to content

Conversation

@ThomasWaldmann
Copy link
Member

@ThomasWaldmann ThomasWaldmann commented Jun 5, 2025

buzhash table creation

Instead of changing the table_base with the pre-computed 32bit "random" values into one with pre-computed 64bit "random" values, I decided to rather drop it from the source code and deterministically generate the table using sha256(i, key). Instead of the previous seed it now uses a key that is derived from the ID key.

That should also resolve some points of criticism about the old buzhash 32bit code:

  • table_base: that the bits are not randomly distributed enough.
  • that an XORed seed cancels out for specific window sizes
  • that XORing the table with a seed is equivalent to XORing the computed hash value with another constant

Sizes

key / seed: now 256bits, was 32bits.
buzhash: now 64bits, was 32bits.

Performance

buzhash,19,23,21,4095          1GB        0.884s  # 32bit
buzhash64,19,23,21,4095        1GB        0.909s  # 64bit

(measurements from Apple MBP M3P)

Security?

Guess these changes make it harder against fingerprinting attacks - some independent review about this would be very welcome.

See also: https://github.com/borgbackup/borg/wiki/CDC-issues-reported-2025

First paper, section 3.5 (compression enabled case) suggests using a 64bit hash (if I understood it correctly).

Second paper suggests doing something like AES(buzhash_value) before doing the chunking decision (== for every byte position in a file). That was shifted to a future PR.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jun 5, 2025

Some notes:

A) Before working on buzhash64, I had a look at fastCDC and gear hashing.

But because gear just does a simple bit shift (not a bit rotate as buzhash does), it can only work with window sizes of up to 64 bytes. I had the gut feeling that this limit configurability for the context size for a "chunking point" compared to borg's / buzhash's, which can be configured in a wider range (default: 4095 bytes).

B) The chunking "quality" of the new chunker is researched now, see below.

@codecov
Copy link

codecov bot commented Jun 5, 2025

Codecov Report

Attention: Patch coverage is 60.86957% with 9 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (81bacd0) to head (d23704e).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
src/borg/helpers/parseformat.py 0.00% 8 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8903      +/-   ##
==========================================
+ Coverage   81.83%   82.03%   +0.19%     
==========================================
  Files          77       77              
  Lines       13466    13481      +15     
  Branches     1990     1995       +5     
==========================================
+ Hits        11020    11059      +39     
+ Misses       1773     1757      -16     
+ Partials      673      665       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jun 5, 2025

Analysis script's output:

(borg-env) tw@MacBook-Pro borg % python3 scripts/chunker_comparison.py
================================================================================
BORG CHUNKER STATISTICAL ANALYSIS
================================================================================
Parameters:
  minexp=19 (min chunk size: 524288 bytes)
  maxexp=23 (max chunk size: 8388608 bytes)
  maskbits=21 (target avg chunk size: ~2097152 bytes)
  winsize=4095
--------------------------------------------------------------------------------
Generating 100.0MB of random data...
Analyzing chunkers...

Chunker Statistics:
Chunker: BuzHash
  Number of chunks: 53
  Min chunk size: 540517 bytes
  Max chunk size: 7077458 bytes
  Mean chunk size: 1978445.28 bytes
  Median chunk size: 1441271.00 bytes
  Standard deviation: 1551511.71 bytes
  Number of chunks at min size: 0 (0.00%)
  Number of chunks at max size: 0 (0.00%)

Chunker: BuzHash64e
  Number of chunks: 41
  Min chunk size: 545779 bytes
  Max chunk size: 6844073 bytes
  Mean chunk size: 2557502.44 bytes
  Median chunk size: 2500295.00 bytes
  Standard deviation: 1789725.65 bytes
  Number of chunks at min size: 0 (0.00%)
  Number of chunks at max size: 0 (0.00%)

Comparison:
  BuzHash64/BuzHash chunk count ratio: 0.77
  BuzHash64/BuzHash mean chunk size ratio: 1.29
  BuzHash64/BuzHash std dev ratio: 1.15

Chunk Size Distribution (power-of-2 buckets):
  Size Bucket | BuzHash Count (%) | BuzHash64 Count (%)
  -----------|-------------------|------------------
     1048576 |    17 ( 32.1%) |    12 ( 29.3%)
     2097152 |    21 ( 39.6%) |     8 ( 19.5%)
     4194304 |     9 ( 17.0%) |    13 ( 31.7%)
     8388608 |     6 ( 11.3%) |     8 ( 19.5%)

Added some *64*.* files that are just 1:1 copies of their
32bit counterparts, so that the changes for the 64bit
adaption will later be better visible.
the buzhash seed only has 32bits, but we rather
want 64bits for buzhash64.

just take them from crypt_key for now.
That way we can feed lots of entropy into the table creation.

The bh64_key is derived from the id_key (NOT the crypt_key), thus
it will create the same key for related repositories (even if they
use different encryption/authentication keys). Due to that, it will
also create the same buzhash64 table, will cut chunks at the same
points and deduplication will work amongst the related repositories.
@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jun 10, 2025

Hmm, currently it is using AES-CTR mode, but guess that's wrong because it likely destroys dedup between similar files (e.g. if one has some bytes inserted at the beginning: due to the offsets of later data being shifted, it would get different encryption tokens from the PRF stream and thus (for the sambe buzhash value) would decide differently whether to chunk or not).

Maybe even that whole PRF stream approach is wrong and I need to go back just using AES-ECB on each hash value (which is 5x slower due to EVP api call overhead).

Update: I removed the AES buzhash encryption from this PR, will continue working on that in a separate PR.

@ThomasWaldmann ThomasWaldmann merged commit 9a65d52 into borgbackup:master Jun 11, 2025
15 of 16 checks passed
@ThomasWaldmann ThomasWaldmann deleted the buzhash64 branch June 11, 2025 06:31
# (1) the hash is designed for inputs <= 64 bytes, but the chunker uses it on a 4095 byte window;
# any repeating bytes at distance 64 within those 4095 bytes can cause cancellation within
# the hash function, e.g. in "X <any 63 bytes> X", the last X would cancel out the influence
# of the first X on the hash value.
Copy link

@darkk darkk Jun 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add that the following comment also applies to buzhash64:

# (2) the hash table is supposed to have (according to the BUZ) exactly a 50% distribution of
#     0/1 bit values per position, but the generated table below doesn't fit that property.

Truncated SHA-256 does not guarantee bit-balance.

While it's possible to achieve bit-balance with additional logic — such as seeding a CSPRNG with a key and selecting 128 of 256 bits to be 1 (e.g. using a Fisher–Yates shuffle directly or vendoring some version of random.sample()) — I'm not convinced that such a measure is necessary to produce a good hash. More importantly, it is not sufficient on its own to ensure good hash properties.

For instance, there is an interesting corner case.

Robert Uzgalis originally defined Buzhash() as a 31-bit function ([BUZ]) for reasons unclear to me. I was unable to find the original publications ([GEN], [MYTHS]) that were referenced by Uzgalis himself ([BH]), but it appears this design choice may be related to the signedness of the result rather than any inherent property of the function. John Hamer ([JH]) also implemented it as a 31-bit function, though again, this might be due to Java's use of signed integers.

References:

Interestingly, it is possible to construct substitution tables that satisfy the bit-balance requirement while effectively reducing Buzhash32 down to a 31-bit function. Specifically, if popcount(table[c]) is even for every byte c, then popcount(Buzhash(x)) will also be even for any input. This follows trivially from the fact that both rot() and xor() preserve the parity of the popcount.

Similarly, if popcount(table[c]) is odd for every byte c, then the parity of popcount(Buzhash(x)) will depend on the window size.

However, generating an "Unlucky" table of that kind using SHA(i + key) is really improbable.

So, while strict 50% bit distribution might help reduce bias, it is not sufficient to make Buzhash equidistributed. Also, I’m not aware of a simple way to measure that bias or to prove its presence with confidence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I missed that pseudo-randomness of course does not guarantee precise 50/50 bit distribution per bit position.

About the 31bit stuff: OFC, I don't know what they were thinking, but I think it was probably to avoid signedness issues.

About reducing it accidentally to a 31bit function: guess the probability for the "even case" is 1 : 2^256 and thus we can safely ignore it? Same for the "odd case".

I removed the sha256 from the code in my new PR. The key fed to the table init function is derived from a 256bit_random_secret.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess the probability for the "even case" is 1 : 2^256

That's right. However, my concern lies in a different area. Buzhash is used in several projects, however I've not found a single publication mentioning alike "unlucky" tables as the possible problem of the hash function. That makes me think that this specific hash function was not well-studied and might have some more pitfalls of higher probability.

I mean, I'm not worried about 1:2^256 case. I'm worried by something unknown and similar to going from window % 64 = 0 being problematic for xor-seed case down to "every even window being problematic" (#8868).

#8924

Yes, I think, it's the best (known) way to go for Buzhash tables.

Thanks for your work on Borg! It's truly awesome.

cdef uint64_t _buzhash64_update(uint64_t sum, unsigned char remove, unsigned char add, size_t len, const uint64_t* h):
"""Update the buzhash with a new byte."""
cdef uint64_t lenmod = len & 0x3f
return BARREL_SHIFT64(sum, 1) ^ BARREL_SHIFT64(h[remove], lenmod) ^ h[add]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to mention a few Buzhash-related optimizations that might help improve chunker performance.

1. Inlining _buzhash_update()

Buzhash is quite fast, even as a one-byte-at-a-time function — it measured at around 2.5 cycles/byte in my benchmarks. As a result, the overhead of calling _buzhash_update() can be comparable to the cost of the actual computation. For this reason, it's important that the compiler can inline _buzhash_update() inside the tight while ... loop.

This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case. However, I noticed you've defined BARREL_SHIFT64() as a macro rather than a static function, which suggests you might already be aware of possible inlining pitfalls — hence why I wanted to raise this just in case.

2. Using a Precomputed Rotated Table

You might also consider trading 2 KiB of memory for a potential speedup.

Since ChunkerBuzHash64() takes window_size as a parameter, you could precompute BARREL_SHIFT64(x, lenmod) for each value, doubling the size of the lookup table to 2 * 256 * sizeof(uint64_t) bytes. If memory serves me well, this optimization gave me an ≈25% speedup.

One more detail: I found that performance varied depending on how the rotated values were stored. Surprisingly, concatenating the original and rotated tables performed better in my tests than interleaving them, though that result was a bit counterintuitive for me.

3. Using a Sentinel to Eliminate Bounds Checks

There’s a more aggressive optimization that avoids bounds checking on each while ... loop iteration.

By appending a sentinel value at the end of the buffer, you can guarantee that a cut point will always be found — either in the actual data, within the sentinel or at the very end of it. Specifically, the sentinel should consist of window_size bytes that produce a Buzhash() result with hash & chunk_mask == 0.

This optimization resulted in an improvement of approximately 0.25 cycles/byte, translating to a ≈10% speedup. That said, YMMV — its effectiveness likely depends on CPU pipelining and the relative sizes of buf_size (1 << chunk_max_exp), window_size, and the CPU's Last-Level Cache (LLC). For example, a larger buf_size means more bounds checks are avoided, while a larger window_size increases the overhead of computing the Buzhash of sentinel. If buf_size exceeds the LLC, you may also encounter RAM-related stalls.

I haven't found an easy way to compute a sentinel directly from the Buzhash table. The options I used were either to memoize the first valid cut point when it occurs, or to generate a random stream of bytes and rely on probability to do its job. The resulting sentinel can then be stored either at runtime ChunkerBuzHash64() object or as part of the repository metadata.

For full transparency: my experiments were done in plain C11, and I don’t have much experience with Cython. Please feel free to disregard any of these suggestions if they don't apply well to your codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your insightful comments, very much appreciated!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried 2 things:

  • using cdef inline ... for the _buzhash_update function
  • manually inlining it into the loop

I didn't see a notable difference on my machine (running the borg benchmark cpu chunker tests for ~10s). Maybe the C compiler is clever enough to already do that.

Or the M3P cpu is very efficient when calling small functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried to avoid that the first BARREL_SHIFT macro parameter gets "evaluated" twice due to macro expansion by pre-computing it. It also didn't make a difference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case

Yep, I've verified that. That's the case at least with Cython version 0.29.28. The resulting __pyx_f_4borg_8chunkers_9buzhash64__buzhash64_update() function is static. Turns out, getting C code is as simple as cython3 buzhash64.pyx :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest win (about 15-20% faster) was 8-fold loop unrolling and 8-fold table value prefetching to improve memory access patterns.

But I don't really like it, bloats the src code quite a bit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing! It's an interesting outcome to know. I've also tried unrolling & aligned prefetching, but gained nothing measurable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even enforce alignment, just prefetching 8 data bytes / 8 table entries directly after each other.

Maybe ARM cpus behave a bit differently here than Intel/AMD.

snorkelopstesting4-web pushed a commit to snorkel-marlin-repos/borgbackup_borg_pr_8903_b411e635-eff7-47c7-b564-b0a8853314c2 that referenced this pull request Oct 22, 2025
Original PR #8903 by ThomasWaldmann
Original: borgbackup/borg#8903
snorkelopstesting4-web added a commit to snorkel-marlin-repos/borgbackup_borg_pr_8903_b411e635-eff7-47c7-b564-b0a8853314c2 that referenced this pull request Oct 22, 2025
Merged from original PR #8903
Original: borgbackup/borg#8903
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants