Centralize hashing through _hash_bytes and tag value types#1628
Merged
Conversation
Introduce a single `_hash_bytes(data: bytes) -> str` helper that all fingerprinting paths route through, so the underlying hashing algorithm can later be changed in exactly one place. As part of unifying on a single chokepoint, the sha224 used in the sequence/mapping/set paths is folded into the md5 already used elsewhere; digest width is unchanged. Close confirmed fingerprint collisions by tagging primitives and bytes with their type, so `1`, `"1"`, `b"1"`, `1.0` and `"1.0"` hash distinctly. Add a cross-type must-differ test. Restructure the digest tests: literal-digest pins are kept only for version-portable types (primitives, sequences, mappings, sets, numpy) and labelled as an algorithm-portability guard; the numpy case pins its dtype so the literal is reproducible across platforms. Version-sensitive DataFrame digests are covered by relational must-differ / must-match tests instead of brittle literals. This is a fingerprint-changing refactor: cached fingerprints from prior versions will no longer match and will be recomputed. It is the groundwork that makes vectorized DataFrame hashing and a later non-cryptographic algorithm swap minimal, isolated changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Split 1 of 3 of #1619
xxhash.xxh3_128.What this does
Cache fingerprinting (
hamilton/caching/fingerprinting.py) maps a Python value to adata_versionstring used in cache keys. This PR is pure groundwork: it routes every path through one helper and fixes confirmed collisions, without vectorizing or changing to a non-cryptographic algorithm (those are the follow-ups). It makes the later algorithm swap a one-line change.1. Introduce a single
_hash_byteschokepoint. Every implementation now ends in_hash_bytes(data), which wraps the existing_compact_hashbase64url encoding:The container paths (
hash_sequence,hash_mapping,hash_set) previously each instantiated their ownhashlib.sha224()and.update()-looped; they now build onebytesbuffer and hand it to_hash_bytes. This is the refactor that lets a later PR change the algorithm in exactly one place.2. Unify on a single digest (md5) at that chokepoint. The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences/mappings/sets. A single chokepoint requires a single digest, so the sha224 paths are folded onto the md5 already trusted elsewhere in the module. See Why 128-bit is sufficient below.
3. Close confirmed collisions with type tags. Primitives and bytes now carry a type tag, so
1,"1",b"1",1.0, and"1.0"hash distinctly:Previously
str(1) == str("1") == "1"collapsed int/str (and likewise float/str, bytes/str) into identical fingerprints.Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the unification and collision fixes together.
Why 128-bit is sufficient (folding sha224 → md5)
Replacing the wider sha224 with a 128-bit digest is safe here:
This PR keeps md5 (still cryptographic) — it does not claim a throughput win. The point is solely to collapse to one digest so the algorithm becomes swappable in one line. The non-cryptographic, faster
xxh3_128swap is the third PR.Testing
1,"1",b"1",1.0,"1.0"all distinct).pytest.importorskip(the polars wheel crashes on hosts lacking certain CPU features), so they run on CI rather than in every local environment.Checklist