Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619
Speed up and harden cache fingerprinting: xxh3_128 + vectorized DataFrame hashing + collision fixes#1619Dev-iL wants to merge 2 commits into
Conversation
Replace the md5/sha224 hashes in the caching fingerprinting module with the non-cryptographic xxhash.xxh3_128, routed through a single shared _hash_bytes helper. xxh3_128 produces a 16-byte digest (24 base64url chars, identical width to the md5 already in use), so collision resistance is preserved while throughput on buffer-bound paths rises substantially. Vectorize the DataFrame paths: - pandas: hash the hash_pandas_object(obj).values uint64 buffer in one shot instead of round-tripping through .to_dict() and a per-row Python loop; fold column names + dtypes (schema) into the hash; keep the path order-sensitive and correct the misleading docstring. - polars: hash the hash_rows().to_numpy() buffer in one shot instead of .to_list() into a per-element hash_sequence loop; keep the schema_hash + row_hash combine introduced in apache#1616. Close confirmed fingerprint collisions by tagging primitives and bytes with their type, so 1, "1", b"1", 1.0 and "1.0" hash distinctly, and pandas frames with identical values but different column names or dtypes no longer collide. Recompute the pinned literal-digest tests against the new algorithm and add must-differ / must-match collision tests plus a benchmark script demonstrating the pandas speedup (~14x on a 500k-row frame). Declare xxhash>=0.8.0 as a core runtime dependency (xxh3_128 was added in 0.8.0); fingerprinting is imported eagerly via the caching adapter, so it must be a hard dependency rather than an optional extra. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
xxhash (the python-xxhash package) is a new runtime dependency licensed under BSD-2-Clause, whose terms require reproducing the copyright notice and licence text. Append it to LICENSE in the same style as the existing third-party (MIT databackend) entry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jernejfrank
left a comment
There was a problem hiding this comment.
Looks good, the speedup is amazing! I just have the one concern if invalidating existing caches is breaking in case some users relied on it for cachign some heavy computations.
Valid concern! Several reasons why I think it's alright:
|
jernejfrank
left a comment
There was a problem hiding this comment.
Good point on #1616 , make sense to add this now!
| "pandas", | ||
| "typing_extensions > 4.0.0", | ||
| "typing_inspect", | ||
| "xxhash>=0.8.0", |
There was a problem hiding this comment.
@Dev-iL one thought -- should we make this optional?
There was a problem hiding this comment.
I have a goal of basically having very minimal dependencies if possible
| import sys | ||
| from collections.abc import Mapping, Sequence, Set | ||
|
|
||
| import xxhash |
There was a problem hiding this comment.
I think we should change this so:
- we can function without xxhash
- if it's available we use that version, else we use the slower hashlib one
thoughts?
There was a problem hiding this comment.
- I can't really argue if this is how you/PPMC see the future of the library.
- A precedence for your suggestion exists in pandas.
- Has this concern been raised by users? As a user, I find it unintuitive having to specify an extra to gain performance.
I guess my only argument is - can this be postponed until we carve out the -core package?
There was a problem hiding this comment.
I'd prefer to keep it light -- this is the consistent pattern. But can you highlight the trade-offs/user impacts? We should have removed pandas and numpy the fact that they're there is tech debt hard to see adding another... hamilton[caching] is fine imo
There was a problem hiding this comment.
I'd prefer to keep it light -- this is the consistent pattern.
Here's a benchmark I ran in a recent discussion with @skrawcz:
Environment: clean Python 3.14.3 venv (
uv venv), latest versions installed fresh. Each library imported in its own fresh interpreter subprocess; timing measured withtime.perf_counter()around theimportstatement only (interpreter startup excluded). 20 runs per library.
Library Version Min (ms) Median (ms) Mean (ms) Max (ms) xxhash 3.7.0 0.52 0.64 0.86 4.94 numpy 2.4.6 89.88 99.20 102.47 125.50 polars 1.41.2 130.39 136.18 139.31 161.25 pandas 3.0.3 283.24 299.08 301.85 334.94 (sorted fastest → slowest by median)
Notes
- polars caveat: this CPU lacks AVX2/FMA, so the default
polars1.41.2 wheel crashed on import withSIGILL(illegal instruction). The number above is from thepolars[rtcompat]runtime build (same 1.41.2 version) — the variant Polars recommends for older CPUs. On an AVX2-capable machine the default wheel would be used and its import time may differ slightly.- pandas pulls in numpy (plus
python-dateutil/six) as a dependency, which is why its import is roughly the cost of numpy plus pandas' own modules. The figures reflect each library's full import cost as a user would experience it, which is why they're measured one-per-process rather than sharing a warm interpreter.- xxhash is a thin C-extension binding, hence the sub-millisecond import.
As you can see, xxhash is as light as it gets.
can you highlight the trade-offs/user impacts?
As mentioned in the PR description the hashing time is ~15x shorter with xxhash. I'd be happy to add more benchmarks if there's a specific use case you have in mind.
hamilton[caching] is fine imo
Stefan proposed this extra name too. One thing I don't understand is: does [caching] mean "fast caching" or "caching is possible"? If "slow caching" would be enabled without the extra, I'd say it is counterintuitive. Otherwise, are we ready to move the caching feature to an extra? I'm fine with that.
Follow-up to: #1616
What this does
Cache fingerprinting (
hamilton/caching/fingerprinting.py) maps a Python value to adata_versionstring used in cache keys. This PR makes it faster and more correct, without weakening the collision-prevention guarantees added in #1616.1. Swap the hash algorithm — md5/sha224 →
xxhash.xxh3_128. All hashing now routes through a single_hash_byteshelper wrappingxxhash.xxh3_128(data).digest(), reusing the existing_compact_hashbase64url encoding.2. Vectorize the DataFrame paths (the real bottleneck — see benchmark):
hash_pandas_object(obj).valuesuint64 buffer in one shot instead of round-tripping through.to_dict()and an ordered per-rowhash_mapping. Column names + dtypes (schema) are folded in; the path stays order-sensitive (the old docstring claiming row order "doesn't matter" was incorrect and is now fixed).hash_rows().to_numpy()buffer in one shot instead of.to_list()into a per-elementhash_sequenceloop. Theschema_hash + row_hashcombine from Include metadata in numpy/polars cache fingerprints to prevent collisions #1616 is preserved.3. Close confirmed collisions. Primitives and bytes now carry a type tag, so
1,"1",b"1",1.0, and"1.0"hash distinctly; pandas frames with identical values but different column names or dtypes no longer collide.Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the collision fixes alongside the algorithm swap.
Benchmark results
scripts/benchmark_fingerprinting.pyfingerprints a 500,000-row, 3-column DataFrame, comparing the old per-row approach against the new vectorized path (best of 3 runs):to_dict()loop)The structural "no per-row Python loop" assertion is the hard correctness gate; the benchmark is corroborating evidence with a generous ≥5× floor to avoid timing flakiness.
Why xxh3_128 is a sound replacement for the longer sha224
The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences, mappings, and sets. Replacing the wider sha224 with a 128-bit digest is safe here for three reasons:
xxh3_128is purpose-built for this. It is a fast, non-cryptographic hash with strong dispersion (passes the SMHasher quality suite), and at 128 bits it matches the width md5 was already trusted for in the same module — so the swap strengthens the former md5 paths' guarantees to par and keeps the former sha224 paths comfortably collision-safe, while removing the cryptographic-hashing overhead we were paying for no benefit.Net: we trade unused cryptographic headroom for a large throughput win, with collision safety that remains far beyond what any cache will ever exercise.
Dependency & licensing
xxhash>=0.8.0to core runtime dependencies (xxh3_128was introduced in 0.8.0). Fingerprinting is imported eagerly via the caching adapter, so this is a hard dependency, not an optional extra.python-xxhashpackage) is BSD-2-Clause; its copyright and licence text are appended toLICENSEin the same style as the existing third-party (MIT databackend) entry.Testing
list == tuplesequence equality).pytest.importorskip.Checklist