Centralize hashing through `_hash_bytes` and tag value types by Dev-iL · Pull Request #1628 · apache/hamilton

Dev-iL · 2026-06-08T09:26:23Z

Split 1 of 3 of #1619

this PR — centralize hashing through a single chokepoint + close type collisions (no algorithm or performance change beyond unifying the digest).
vectorize the pandas/polars DataFrame paths.
swap the algorithm to xxhash.xxh3_128.

What this does

Cache fingerprinting (hamilton/caching/fingerprinting.py) maps a Python value to a data_version string used in cache keys. This PR is pure groundwork: it routes every path through one helper and fixes confirmed collisions, without vectorizing or changing to a non-cryptographic algorithm (those are the follow-ups). It makes the later algorithm swap a one-line change.

1. Introduce a single _hash_bytes chokepoint. Every implementation now ends in _hash_bytes(data), which wraps the existing _compact_hash base64url encoding:

def _hash_bytes(data: bytes) -> str:
    return _compact_hash(hashlib.md5(data).digest())

The container paths (hash_sequence, hash_mapping, hash_set) previously each instantiated their own hashlib.sha224() and .update()-looped; they now build one bytes buffer and hand it to _hash_bytes. This is the refactor that lets a later PR change the algorithm in exactly one place.

2. Unify on a single digest (md5) at that chokepoint. The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences/mappings/sets. A single chokepoint requires a single digest, so the sha224 paths are folded onto the md5 already trusted elsewhere in the module. See Why 128-bit is sufficient below.

3. Close confirmed collisions with type tags. Primitives and bytes now carry a type tag, so 1, "1", b"1", 1.0, and "1.0" hash distinctly:

# hash_primitive
return _hash_bytes(f"{type(obj).__name__}:{obj}".encode())
# hash_bytes
return _hash_bytes(b"bytes:" + obj)

Previously str(1) == str("1") == "1" collapsed int/str (and likewise float/str, bytes/str) into identical fingerprints.

Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the unification and collision fixes together.

Why 128-bit is sufficient (folding sha224 → md5)

Replacing the wider sha224 with a 128-bit digest is safe here:

Collision resistance is about digest width, not cryptographic strength. For a fingerprint the only property that matters is the probability that two distinct inputs map to the same digest, governed by the birthday bound (~2^(n/2)) for a well-distributed n-bit hash. These fingerprints are never a security boundary — no adversary chooses inputs to force a collision — so sha224's extra resistance to deliberate attacks buys nothing here.
128 bits is astronomically sufficient for cache keys. The birthday bound for 128 bits is ~2^64 (≈1.8×10¹⁹) distinct values before a ~50% collision chance. A cache will never hold anywhere near that; the realistic collision probability is effectively zero.

This PR keeps md5 (still cryptographic) — it does not claim a throughput win. The point is solely to collapse to one digest so the algorithm becomes swappable in one line. The non-cryptographic, faster xxh3_128 swap is the third PR.

Testing

Pinned literal-digest tests are restructured into a portability / algorithm-stability guard: literals are kept only for version-portable types (primitives, sequences, mappings, sets, numpy — whose digest is a function of the value's representation and the algorithm alone) and recomputed against the unified md5 (run, not hand-written). The numpy case pins its dtype so the literal is reproducible across platforms.
Version-sensitive DataFrame digests (which depend on library-version-specific dtype reprs) are not pinned; they are covered by relational must-differ / must-match tests instead of brittle literals.
New must-differ test for cross-type primitives (1, "1", b"1", 1.0, "1.0" all distinct).
Full caching suite passes. Polars-dependent tests are guarded with pytest.importorskip (the polars wheel crashes on hosts lacking certain CPU features), so they run on CI rather than in every local environment.

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

Introduce a single `_hash_bytes(data: bytes) -> str` helper that all fingerprinting paths route through, so the underlying hashing algorithm can later be changed in exactly one place. As part of unifying on a single chokepoint, the sha224 used in the sequence/mapping/set paths is folded into the md5 already used elsewhere; digest width is unchanged. Close confirmed fingerprint collisions by tagging primitives and bytes with their type, so `1`, `"1"`, `b"1"`, `1.0` and `"1.0"` hash distinctly. Add a cross-type must-differ test. Restructure the digest tests: literal-digest pins are kept only for version-portable types (primitives, sequences, mappings, sets, numpy) and labelled as an algorithm-portability guard; the numpy case pins its dtype so the literal is reproducible across platforms. Version-sensitive DataFrame digests are covered by relational must-differ / must-match tests instead of brittle literals. This is a fingerprint-changing refactor: cached fingerprints from prior versions will no longer match and will be recomputed. It is the groundwork that makes vectorized DataFrame hashing and a later non-cryptographic algorithm swap minimal, isolated changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jernejfrank

👍

Dev-iL requested review from elijahbenizzy, jernejfrank and skrawcz June 8, 2026 09:27

This was referenced Jun 8, 2026

Vectorize pandas and polars DataFrame hashing #1629

Open

Switch the fingerprint algo to xxh3_128 #1630

Open

jernejfrank approved these changes Jun 8, 2026

View reviewed changes

jernejfrank merged commit 14c7c23 into apache:main Jun 8, 2026
6 checks passed

Dev-iL deleted the 2606/hash_type_tagging branch June 8, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize hashing through `_hash_bytes` and tag value types#1628

Centralize hashing through `_hash_bytes` and tag value types#1628
jernejfrank merged 1 commit into
apache:mainfrom
SummitSG-LLC:2606/hash_type_tagging

Dev-iL commented Jun 8, 2026

Uh oh!

jernejfrank left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dev-iL commented Jun 8, 2026

What this does

Why 128-bit is sufficient (folding sha224 → md5)

Testing

Checklist

Uh oh!

jernejfrank left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants