buzhash64 chunker #8903

ThomasWaldmann · 2025-06-05T17:33:33Z

buzhash table creation

Instead of changing the table_base with the pre-computed 32bit "random" values into one with pre-computed 64bit "random" values, I decided to rather drop it from the source code and deterministically generate the table using sha256(i, key). Instead of the previous seed it now uses a key that is derived from the ID key.

That should also resolve some points of criticism about the old buzhash 32bit code:

table_base: that the bits are not randomly distributed enough.
that an XORed seed cancels out for specific window sizes
that XORing the table with a seed is equivalent to XORing the computed hash value with another constant

Sizes

key / seed: now 256bits, was 32bits.
buzhash: now 64bits, was 32bits.

Performance

buzhash,19,23,21,4095          1GB        0.884s  # 32bit
buzhash64,19,23,21,4095        1GB        0.909s  # 64bit

(measurements from Apple MBP M3P)

Security?

Guess these changes make it harder against fingerprinting attacks - some independent review about this would be very welcome.

See also: https://github.com/borgbackup/borg/wiki/CDC-issues-reported-2025

First paper, section 3.5 (compression enabled case) suggests using a 64bit hash (if I understood it correctly).

Second paper suggests doing something like AES(buzhash_value) before doing the chunking decision (== for every byte position in a file). That was shifted to a future PR.

ThomasWaldmann · 2025-06-05T17:43:56Z

Some notes:

A) Before working on buzhash64, I had a look at fastCDC and gear hashing.

But because gear just does a simple bit shift (not a bit rotate as buzhash does), it can only work with window sizes of up to 64 bytes. I had the gut feeling that this limit configurability for the context size for a "chunking point" compared to borg's / buzhash's, which can be configured in a wider range (default: 4095 bytes).

B) The chunking "quality" of the new chunker is researched now, see below.

codecov · 2025-06-05T17:44:02Z

Codecov Report

Attention: Patch coverage is 60.86957% with 9 lines in your changes missing coverage. Please review.

Project coverage is 82.03%. Comparing base (81bacd0) to head (d23704e).
Report is 19 commits behind head on master.

Files with missing lines	Patch %	Lines
src/borg/helpers/parseformat.py	0.00%	8 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8903      +/-   ##
==========================================
+ Coverage   81.83%   82.03%   +0.19%     
==========================================
  Files          77       77              
  Lines       13466    13481      +15     
  Branches     1990     1995       +5     
==========================================
+ Hits        11020    11059      +39     
+ Misses       1773     1757      -16     
+ Partials      673      665       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ThomasWaldmann · 2025-06-05T18:35:19Z

Analysis script's output:

(borg-env) tw@MacBook-Pro borg % python3 scripts/chunker_comparison.py
================================================================================
BORG CHUNKER STATISTICAL ANALYSIS
================================================================================
Parameters:
  minexp=19 (min chunk size: 524288 bytes)
  maxexp=23 (max chunk size: 8388608 bytes)
  maskbits=21 (target avg chunk size: ~2097152 bytes)
  winsize=4095
--------------------------------------------------------------------------------
Generating 100.0MB of random data...
Analyzing chunkers...

Chunker Statistics:
Chunker: BuzHash
  Number of chunks: 53
  Min chunk size: 540517 bytes
  Max chunk size: 7077458 bytes
  Mean chunk size: 1978445.28 bytes
  Median chunk size: 1441271.00 bytes
  Standard deviation: 1551511.71 bytes
  Number of chunks at min size: 0 (0.00%)
  Number of chunks at max size: 0 (0.00%)

Chunker: BuzHash64e
  Number of chunks: 41
  Min chunk size: 545779 bytes
  Max chunk size: 6844073 bytes
  Mean chunk size: 2557502.44 bytes
  Median chunk size: 2500295.00 bytes
  Standard deviation: 1789725.65 bytes
  Number of chunks at min size: 0 (0.00%)
  Number of chunks at max size: 0 (0.00%)

Comparison:
  BuzHash64/BuzHash chunk count ratio: 0.77
  BuzHash64/BuzHash mean chunk size ratio: 1.29
  BuzHash64/BuzHash std dev ratio: 1.15

Chunk Size Distribution (power-of-2 buckets):
  Size Bucket | BuzHash Count (%) | BuzHash64 Count (%)
  -----------|-------------------|------------------
     1048576 |    17 ( 32.1%) |    12 ( 29.3%)
     2097152 |    21 ( 39.6%) |     8 ( 19.5%)
     4194304 |     9 ( 17.0%) |    13 ( 31.7%)
     8388608 |     6 ( 11.3%) |     8 ( 19.5%)

Added some *64*.* files that are just 1:1 copies of their 32bit counterparts, so that the changes for the 64bit adaption will later be better visible.

the buzhash seed only has 32bits, but we rather want 64bits for buzhash64. just take them from crypt_key for now.

That way we can feed lots of entropy into the table creation. The bh64_key is derived from the id_key (NOT the crypt_key), thus it will create the same key for related repositories (even if they use different encryption/authentication keys). Due to that, it will also create the same buzhash64 table, will cut chunks at the same points and deduplication will work amongst the related repositories.

ThomasWaldmann · 2025-06-10T21:30:28Z

Hmm, currently it is using AES-CTR mode, but guess that's wrong because it likely destroys dedup between similar files (e.g. if one has some bytes inserted at the beginning: due to the offsets of later data being shifted, it would get different encryption tokens from the PRF stream and thus (for the sambe buzhash value) would decide differently whether to chunk or not).

Maybe even that whole PRF stream approach is wrong and I need to go back just using AES-ECB on each hash value (which is 5x slower due to EVP api call overhead).

Update: I removed the AES buzhash encryption from this PR, will continue working on that in a separate PR.

darkk · 2025-06-14T21:20:58Z

src/borg/chunkers/buzhash64.pyx

+# (1) the hash is designed for inputs <= 64 bytes, but the chunker uses it on a 4095 byte window;
+#     any repeating bytes at distance 64 within those 4095 bytes can cause cancellation within
+#     the hash function, e.g. in "X <any 63 bytes> X", the last X would cancel out the influence
+#     of the first X on the hash value.


I'd like to add that the following comment also applies to buzhash64:

# (2) the hash table is supposed to have (according to the BUZ) exactly a 50% distribution of # 0/1 bit values per position, but the generated table below doesn't fit that property.

Truncated SHA-256 does not guarantee bit-balance.

While it's possible to achieve bit-balance with additional logic — such as seeding a CSPRNG with a key and selecting 128 of 256 bits to be 1 (e.g. using a Fisher–Yates shuffle directly or vendoring some version of random.sample()) — I'm not convinced that such a measure is necessary to produce a good hash. More importantly, it is not sufficient on its own to ensure good hash properties.

For instance, there is an interesting corner case.

Robert Uzgalis originally defined Buzhash() as a 31-bit function ([BUZ]) for reasons unclear to me. I was unable to find the original publications ([GEN], [MYTHS]) that were referenced by Uzgalis himself ([BH]), but it appears this design choice may be related to the signedness of the result rather than any inherent property of the function. John Hamer ([JH]) also implemented it as a 31-bit function, though again, this might be due to Java's use of signed integers.

References:

[BUZ] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html

[GEN] Uzgalis RC. General hash functions. Technical Report, Dept. of Computer Science, University of Hong Kong, 1992.

[MYTHS] Uzgalis and Tong. Hashing Myths. Technical Report 97, Dept. of Computer Science, University of Auckland, July 1994.

[BH] http://www.serve.net/buz/buz.basic/research.html#hashing

[JH] John Hamer. Hashing revisited (2002). DOI:10.1145/637610.544440

Interestingly, it is possible to construct substitution tables that satisfy the bit-balance requirement while effectively reducing Buzhash32 down to a 31-bit function. Specifically, if popcount(table[c]) is even for every byte c, then popcount(Buzhash(x)) will also be even for any input. This follows trivially from the fact that both rot() and xor() preserve the parity of the popcount.

Similarly, if popcount(table[c]) is odd for every byte c, then the parity of popcount(Buzhash(x)) will depend on the window size.

However, generating an "Unlucky" table of that kind using SHA(i + key) is really improbable.

So, while strict 50% bit distribution might help reduce bias, it is not sufficient to make Buzhash equidistributed. Also, I’m not aware of a simple way to measure that bias or to prove its presence with confidence.

Good catch, I missed that pseudo-randomness of course does not guarantee precise 50/50 bit distribution per bit position.

About the 31bit stuff: OFC, I don't know what they were thinking, but I think it was probably to avoid signedness issues.

About reducing it accidentally to a 31bit function: guess the probability for the "even case" is 1 : 2^256 and thus we can safely ignore it? Same for the "odd case".

I removed the sha256 from the code in my new PR. The key fed to the table init function is derived from a 256bit_random_secret.

guess the probability for the "even case" is 1 : 2^256

That's right. However, my concern lies in a different area. Buzhash is used in several projects, however I've not found a single publication mentioning alike "unlucky" tables as the possible problem of the hash function. That makes me think that this specific hash function was not well-studied and might have some more pitfalls of higher probability.

I mean, I'm not worried about 1:2^256 case. I'm worried by something unknown and similar to going from window % 64 = 0 being problematic for xor-seed case down to "every even window being problematic" (#8868).

#8924

Yes, I think, it's the best (known) way to go for Buzhash tables.

Thanks for your work on Borg! It's truly awesome.

darkk · 2025-06-17T19:08:28Z

src/borg/chunkers/buzhash64.pyx

+cdef uint64_t _buzhash64_update(uint64_t sum, unsigned char remove, unsigned char add, size_t len, const uint64_t* h):
+    """Update the buzhash with a new byte."""
+    cdef uint64_t lenmod = len & 0x3f
+    return BARREL_SHIFT64(sum, 1) ^ BARREL_SHIFT64(h[remove], lenmod) ^ h[add]


I'd also like to mention a few Buzhash-related optimizations that might help improve chunker performance.

1. Inlining _buzhash_update()

Buzhash is quite fast, even as a one-byte-at-a-time function — it measured at around 2.5 cycles/byte in my benchmarks. As a result, the overhead of calling _buzhash_update() can be comparable to the cost of the actual computation. For this reason, it's important that the compiler can inline _buzhash_update() inside the tight while ... loop.

This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case. However, I noticed you've defined BARREL_SHIFT64() as a macro rather than a static function, which suggests you might already be aware of possible inlining pitfalls — hence why I wanted to raise this just in case.

2. Using a Precomputed Rotated Table

You might also consider trading 2 KiB of memory for a potential speedup.

Since ChunkerBuzHash64() takes window_size as a parameter, you could precompute BARREL_SHIFT64(x, lenmod) for each value, doubling the size of the lookup table to 2 * 256 * sizeof(uint64_t) bytes. If memory serves me well, this optimization gave me an ≈25% speedup.

One more detail: I found that performance varied depending on how the rotated values were stored. Surprisingly, concatenating the original and rotated tables performed better in my tests than interleaving them, though that result was a bit counterintuitive for me.

3. Using a Sentinel to Eliminate Bounds Checks

There’s a more aggressive optimization that avoids bounds checking on each while ... loop iteration.

By appending a sentinel value at the end of the buffer, you can guarantee that a cut point will always be found — either in the actual data, within the sentinel or at the very end of it. Specifically, the sentinel should consist of window_size bytes that produce a Buzhash() result with hash & chunk_mask == 0.

This optimization resulted in an improvement of approximately 0.25 cycles/byte, translating to a ≈10% speedup. That said, YMMV — its effectiveness likely depends on CPU pipelining and the relative sizes of buf_size (1 << chunk_max_exp), window_size, and the CPU's Last-Level Cache (LLC). For example, a larger buf_size means more bounds checks are avoided, while a larger window_size increases the overhead of computing the Buzhash of sentinel. If buf_size exceeds the LLC, you may also encounter RAM-related stalls.

I haven't found an easy way to compute a sentinel directly from the Buzhash table. The options I used were either to memoize the first valid cut point when it occurs, or to generate a random stream of bytes and rely on probability to do its job. The resulting sentinel can then be stored either at runtime ChunkerBuzHash64() object or as part of the repository metadata.

For full transparency: my experiments were done in plain C11, and I don’t have much experience with Cython. Please feel free to disregard any of these suggestions if they don't apply well to your codebase.

Thanks for your insightful comments, very much appreciated!

I tried 2 things:

using cdef inline ... for the _buzhash_update function

manually inlining it into the loop

I didn't see a notable difference on my machine (running the borg benchmark cpu chunker tests for ~10s). Maybe the C compiler is clever enough to already do that.

Or the M3P cpu is very efficient when calling small functions.

I also tried to avoid that the first BARREL_SHIFT macro parameter gets "evaluated" twice due to macro expansion by pre-computing it. It also didn't make a difference.

This is typically achieved by declaring the function with static linkage. If I understand generate_cfunction_declaration() from Cython correctly, that should already be the case

Yep, I've verified that. That's the case at least with Cython version 0.29.28. The resulting __pyx_f_4borg_8chunkers_9buzhash64__buzhash64_update() function is static. Turns out, getting C code is as simple as cython3 buzhash64.pyx :-)

Biggest win (about 15-20% faster) was 8-fold loop unrolling and 8-fold table value prefetching to improve memory access patterns.

But I don't really like it, bloats the src code quite a bit.

Thanks for sharing! It's an interesting outcome to know. I've also tried unrolling & aligned prefetching, but gained nothing measurable.

I didn't even enforce alignment, just prefetching 8 data bytes / 8 table entries directly after each other.

Maybe ARM cpus behave a bit differently here than Intel/AMD.

Original PR #8903 by ThomasWaldmann Original: borgbackup/borg#8903

Merged from original PR #8903 Original: borgbackup/borg#8903

This was referenced Jun 5, 2025

borg2: implement new chunker? #8841

Open

Speeding up deduplication #5721

Open

new chunking algorithms #3026

Open

ThomasWaldmann mentioned this pull request Jun 5, 2025

Buzhash xor-seed requires odd hash-window #8868

Closed

ThomasWaldmann force-pushed the buzhash64 branch from 2d62f7f to 4d6abf1 Compare June 6, 2025 20:39

ThomasWaldmann mentioned this pull request Jun 6, 2025

Chunker permutation #5070

Closed

ThomasWaldmann force-pushed the buzhash64 branch 3 times, most recently from 6f29108 to 3a3aae7 Compare June 8, 2025 17:06

ThomasWaldmann mentioned this pull request Jun 8, 2025

cryptographically secure keyed rolling hash / MAC? #2128

Open

ThomasWaldmann added 7 commits June 10, 2025 22:44

chunkers: prepare for buzhash64

cbe6ba7

Added some *64*.* files that are just 1:1 copies of their 32bit counterparts, so that the changes for the 64bit adaption will later be better visible.

buzhash64: adapt buzhash and tests for 64bit

6a6622f

buzhash64: integrate into build

63ff136

buzhash64: integrate into borg benchmark command

dc2dab1

ChunkerParams: add support for buzhash64

6f55cba

get_chunker: give it the key instead of the seed

544b3f4

the buzhash seed only has 32bits, but we rather want 64bits for buzhash64. just take them from crypt_key for now.

ThomasWaldmann force-pushed the buzhash64 branch from 3a3aae7 to 51f4fe9 Compare June 10, 2025 20:47

buzhash64: docs

d23704e

ThomasWaldmann force-pushed the buzhash64 branch from 51f4fe9 to d23704e Compare June 10, 2025 21:42

ThomasWaldmann merged commit 9a65d52 into borgbackup:master Jun 11, 2025
15 of 16 checks passed

ThomasWaldmann deleted the buzhash64 branch June 11, 2025 06:31

darkk reviewed Jun 14, 2025

View reviewed changes

darkk reviewed Jun 17, 2025

View reviewed changes

snorkelopstesting4-web pushed a commit to snorkel-marlin-repos/borgbackup_borg_pr_8903_b411e635-eff7-47c7-b564-b0a8853314c2 that referenced this pull request Oct 22, 2025

buzhash64 chunker

bb9adde

Original PR #8903 by ThomasWaldmann Original: borgbackup/borg#8903

snorkelopstesting4-web mentioned this pull request Oct 22, 2025

buzhash64 chunker snorkel-marlin-repos/borgbackup_borg_pr_8903_b411e635-eff7-47c7-b564-b0a8853314c2#1

Merged

snorkelopstesting4-web added a commit to snorkel-marlin-repos/borgbackup_borg_pr_8903_b411e635-eff7-47c7-b564-b0a8853314c2 that referenced this pull request Oct 22, 2025

Merge pull request #1: buzhash64 chunker

b5d5179

Merged from original PR #8903 Original: borgbackup/borg#8903

Uh oh!

buzhash64 chunker #8903

buzhash64 chunker #8903

Uh oh!

Conversation

ThomasWaldmann commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

buzhash table creation

Sizes

Performance

Security?

Uh oh!

ThomasWaldmann commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ThomasWaldmann commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ThomasWaldmann commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

darkk Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1. Inlining _buzhash_update()

2. Using a Precomputed Rotated Table

3. Using a Sentinel to Eliminate Bounds Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ThomasWaldmann commented Jun 5, 2025 •

edited

Loading

ThomasWaldmann commented Jun 5, 2025 •

edited

Loading

codecov bot commented Jun 5, 2025 •

edited

Loading

ThomasWaldmann commented Jun 5, 2025 •

edited

Loading

ThomasWaldmann commented Jun 10, 2025 •

edited

Loading

darkk Jun 14, 2025 •

edited

Loading

1. Inlining `_buzhash_update()`