Replace SHA-512 with vendored portable BLAKE3 for chunk addresses#667
Conversation
Profile showed sha512_transform as the second-biggest CPU hotspot
on write workloads (after the now-fixed mutmap qsort). prollyHashCompute
hashes every emitted chunk and is hot on every commit.
Microbench on a 4 KB input (representative chunk size):
SHA-512 (existing): 13,380 ns/op 306 MB/s
vendored BLAKE3 (portable): 5,094 ns/op 804 MB/s 2.63x
libblake3 (SIMD-accelerated): 2,441 ns/op 1,678 MB/s 5.48x
End-to-end timing on a 200K-row insert workload (10 runs):
master (SHA-512): median 1.170s mean 1.182s
branch (BLAKE3): median 0.965s mean 0.984s
Δ: 17.5% faster (median), 16.8% (mean)
Math reconciles: hashing was ~30% of write CPU; replacing it with
something 2.6x faster cuts that to ~12%, total work drops by ~18%.
Why portable, not SIMD:
- WASM/iOS/Android targets: no SSE2/AVX/NEON intrinsics in the
portable version, so a single source set compiles unchanged on
every target. SIMD paths are a follow-up: they can be added
behind compile-time flags without changing the format.
- Even the portable path beats SHA-512 by 2.6x. The SIMD ceiling
would be another 2x on top.
Why BLAKE3 specifically:
- 256-bit digest (we truncate to 20 bytes, same as SHA-512 was
truncated). Truncation is collision-safe because BLAKE3's output
is itself a uniform digest.
- Faster than SHA-512, faster than BLAKE2b, comparable security
properties to either for content addressing.
What's vendored:
- ext/blake3/blake3.h — public API (verbatim from upstream)
- ext/blake3/blake3_impl.h — internal types/macros, with the
SIMD declarations stripped (we
only use the portable path)
- ext/blake3/blake3.c — high-level hasher (verbatim)
- ext/blake3/blake3_portable.c — portable BLAKE3 round (verbatim)
- ext/blake3/blake3_dispatch_portable.c — minimal dispatch shim
that calls the portable functions
directly. Replaces upstream
blake3_dispatch.c, which does
runtime CPU feature detection.
~30 lines.
Upstream is ~1.8 (CC0/Apache-2.0 dual licensed), suitable for
vendoring under doltlite's Apache-2.0.
BREAKING — on-disk format:
CHUNK_STORE_VERSION 10 -> 11. Chunk content addresses change
(different hash function over the same bytes), so commit hashes
for the same logical data are not the same. Existing 0.10.x DBs
fail to open with the SQLITE_NOTADB + format-mismatch log message
the version-10 bump introduced.
The dead sha512_hash and sha512_transform functions are left in
place for now; clang's DCE strips them. Removal is a follow-up
cleanup.
Verified:
- vc_oracle_merge_test: 41/41
- vc_oracle_branch_test: 30/30
- vc_oracle_diff_test: 41/41
- chunk_distribution_test: 7/7
- 10K + 2K branch + merge end-to-end OK
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sysbench-Style Benchmark: Doltlite vs SQLiteIn-MemoryReads
Writes
File-BackedReads
Writes
10000 rows, single CLI invocation per test, workload-only timing via SQL timestamps. Performance Ceiling Check (2x)All tests within ceilings. |
Compliance + correctness scaffolding for the vendored BLAKE3 portable
reference. Splits into:
- ext/blake3/LICENSE — dual Apache 2.0 (with LLVM exception) and
CC0 1.0 license texts, the form upstream BLAKE3 ships under.
Required by Apache 2.0 §4(a) for redistribution. Includes
upstream copyright notice (Jack O'Connor and Samuel Neves, 2019)
and version pointer (1.8.5).
- ext/blake3/README.md — provenance: which upstream commit, what
files we vendored verbatim, what we modified (blake3_impl.h SIMD
decls stripped) and what we wrote ourselves
(blake3_dispatch_portable.c). Required by Apache 2.0 §4(b).
- SPDX-License-Identifier headers on all five files in ext/blake3/.
Upstream-derived files carry "Apache-2.0 WITH LLVM-exception OR
CC0-1.0"; the doltlite-original dispatch shim carries plain
Apache-2.0 with our own copyright.
- test/blake3_kat_test.sh — Known-Answer Test against the vendored
portable implementation. Six vectors:
* full 32-byte BLAKE3 of empty input (canonical spec vector)
* prollyHashCompute (20-byte truncation) of empty, "abc",
1024 B (1 chunk), 4096 B (4 chunks), 16384 B (16 chunks).
The chunk-multiple cases exercise BLAKE3's tree-mode where
compress_subtree_wide kicks in.
Reference values were computed against upstream libblake3
(the SIMD-accelerated build that's authoritative-by-construction)
and cross-checked against the canonical spec vector. All 6 pass
locally.
- .github/workflows/test.yml — wires the KAT test into the
build-and-test job so a future change accidentally breaking the
hash output (e.g. reckless modification of blake3_portable.c)
fails CI loudly.
- LICENSE.md — new section under "non-public-domain code" listing
the BLAKE3 vendor and pointing at ext/blake3/LICENSE.
No code changes to the BLAKE3 implementation itself — files are
identical to the previous commit, just with SPDX headers prepended.
The KAT confirms the vendored bytes still hash correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that prollyHashCompute uses BLAKE3, the ~90-line in-tree SHA-512 implementation (sha512_rotr, K512 table, sha512_transform, sha512_hash, etc.) is unreachable. clang's DCE was already dropping it from the binary; remove from source so the next reader doesn't have to wonder which hash is in use. Net change: -90 lines from prolly_hash.c. No behavior change — verified via the KAT (6/6 still pass) and the merge oracle (41/41).
Linux's gcc/clang doesn't auto-link libm; macOS does. The KAT harness links prolly_hash.o which references prollyWeibullCheck → expm1, so the link was failing on Linux CI with: undefined reference to 'expm1' Add -lm. Local build is unchanged (macOS already finds it).
Sysbench-Style Benchmark (TEXT PK): Doltlite vs SQLiteCompanion to the classic Sysbench-Style Benchmark. Every workload here In-MemoryReads
Writes
File-BackedReads
Writes
File-Backed (autocommit)Each statement runs as its own transaction — exposes per-commit ReadsReads have no commit cost; these are the same SQL files as the
Writes
1000 rows, single CLI invocation per test, workload-only timing via SQL timestamps. Performance Ceiling Check (3x)All tests within ceilings. Sysbench-Style Benchmark (BLOB PK): Doltlite vs SQLiteCompanion to the classic Sysbench-Style Benchmark. Every workload here In-MemoryReads
Writes
File-BackedReads
Writes
File-Backed (autocommit)Each statement runs as its own transaction — exposes per-commit ReadsReads have no commit cost; these are the same SQL files as the
Writes
1000 rows, single CLI invocation per test, workload-only timing via SQL timestamps. Performance Ceiling Check (3x)All tests within ceilings. Sysbench-Style Benchmark (composite PK): Doltlite vs SQLiteCompanion to the classic Sysbench-Style Benchmark. Every workload here In-MemoryReads
Writes
File-BackedReads
Writes
File-Backed (autocommit)Each statement runs as its own transaction — exposes per-commit ReadsReads have no commit cost; these are the same SQL files as the
Writes
1000 rows, single CLI invocation per test, workload-only timing via SQL timestamps. Performance Ceiling Check (3x)All tests within ceilings. Sysbench-Style Benchmark (autocommit): Doltlite vs SQLiteMoved out of the classic benchmark job so per-commit costs report separately. File-Backed (autocommit)Each statement runs as its own transaction — exposes per-commit ReadsReads have no commit cost; these are the same SQL files as the
Writes
10000 rows, single CLI invocation per test, workload-only timing via SQL timestamps. |
The hand-crafted manifest in Guard 22 (sparse >2GiB open test) hard-
coded VERSION=10. The BLAKE3 chunk-hash change bumped CHUNK_STORE_VERSION
to 11, so the test built a v10 manifest that this branch's csReadManifest
correctly rejects with SQLITE_NOTADB ("file is not a database").
Parse CHUNK_STORE_VERSION out of src/chunk_store.h at test time and pass
it to the perl manifest writer. The test now tracks format-version bumps
without further edits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendors the upstream 1.8.5 SIMD source set (SSE2/SSE4.1/AVX2/AVX-512 on x86, NEON on aarch64) and replaces the portable-only shim with upstream's runtime CPU-feature dispatcher. main.mk picks which SIMD .c files to compile by inspecting `$(B.cc) -dumpmachine`, so wasm32 cross-compiles still produce a portable-only binary. Followup from dolthub#667. Microbench on Apple M-series shows NEON at 2.2x portable (770 MB/s -> 1700 MB/s for 16 KB inputs); x86 should see a similar lift via AVX-512 / AVX2. Per-file -msse2/-msse4.1/-mavx2/-mavx512f -mavx512vl flags are scoped to the matching .c files; the rest of the tree builds at the baseline ISA. Runtime dispatch never calls a backend the CPU doesn't advertise. Test plan - bash test/blake3_kat_test.sh -> 6/6 KAT vectors pass on the NEON build, byte-identical to portable - 10K-row insert + branch + merge smoke test produces correct row count and commit count - Representative oracle suites (branch/diff/merge/log/commit_ancestors) pass with no regressions vs master 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
sha512_transformwas the second-biggest CPU hotspot in the profile (after the now-fixed mutmap qsort).prollyHashComputehashes every emitted chunk; on write workloads it's a real fraction of total CPU. Replacing with portable BLAKE3.Local sysbench result (the actual bench harness, 11-iter median)
File-backed writes (the parity target), master vs this branch:
File-backed reads avg: 1.14 → 1.11.
First PR that's moved the actual sysbench numbers in a meaningful way at the workload shape that matters.
oltp_insertandoltp_write_onlyeach shaved 0.3 off the multiplier.Microbench numbers
4 KB input, 200K iterations:
Hashing was ~30% of write CPU; replacing it with something 2.63× faster cuts that to ~12%, total work drops ~18% on a hash-bound workload (matches the 200K-row local timing).
Why portable, not SIMD
Why BLAKE3 specifically
Vendored layout
Upstream BLAKE3 1.8.5 (CC0/Apache-2.0 dual licensed; compatible with doltlite's Apache-2.0).
CHUNK_STORE_VERSION 10 → 11. Chunk content addresses change (different hash function), so commit hashes for the same logical data are not the same. Existing 0.10.x DBs fail to open with the version-mismatch log message the version-10 bump introduced.Followups
sha512_hash/sha512_transformfunctions are left in place for now; clang's DCE strips them from the binary. Removal is cleanup.Verification
vc_oracle_merge_test: 41/41vc_oracle_branch_test: 30/30vc_oracle_diff_test: 41/41chunk_distribution_test: 7/7Test plan
🤖 Generated with Claude Code