Skip to content

Centralize hashing through _hash_bytes and tag value types#1628

Merged
jernejfrank merged 1 commit into
apache:mainfrom
SummitSG-LLC:2606/hash_type_tagging
Jun 8, 2026
Merged

Centralize hashing through _hash_bytes and tag value types#1628
jernejfrank merged 1 commit into
apache:mainfrom
SummitSG-LLC:2606/hash_type_tagging

Conversation

@Dev-iL

@Dev-iL Dev-iL commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Split 1 of 3 of #1619

  1. this PR — centralize hashing through a single chokepoint + close type collisions (no algorithm or performance change beyond unifying the digest).
  2. vectorize the pandas/polars DataFrame paths.
  3. swap the algorithm to xxhash.xxh3_128.

What this does

Cache fingerprinting (hamilton/caching/fingerprinting.py) maps a Python value to a data_version string used in cache keys. This PR is pure groundwork: it routes every path through one helper and fixes confirmed collisions, without vectorizing or changing to a non-cryptographic algorithm (those are the follow-ups). It makes the later algorithm swap a one-line change.

1. Introduce a single _hash_bytes chokepoint. Every implementation now ends in _hash_bytes(data), which wraps the existing _compact_hash base64url encoding:

def _hash_bytes(data: bytes) -> str:
    return _compact_hash(hashlib.md5(data).digest())

The container paths (hash_sequence, hash_mapping, hash_set) previously each instantiated their own hashlib.sha224() and .update()-looped; they now build one bytes buffer and hand it to _hash_bytes. This is the refactor that lets a later PR change the algorithm in exactly one place.

2. Unify on a single digest (md5) at that chokepoint. The previous code mixed two digests: md5 (128-bit) for primitives/bytes and sha224 (224-bit) for sequences/mappings/sets. A single chokepoint requires a single digest, so the sha224 paths are folded onto the md5 already trusted elsewhere in the module. See Why 128-bit is sufficient below.

3. Close confirmed collisions with type tags. Primitives and bytes now carry a type tag, so 1, "1", b"1", 1.0, and "1.0" hash distinctly:

# hash_primitive
return _hash_bytes(f"{type(obj).__name__}:{obj}".encode())
# hash_bytes
return _hash_bytes(b"bytes:" + obj)

Previously str(1) == str("1") == "1" collapsed int/str (and likewise float/str, bytes/str) into identical fingerprints.

Any fingerprint change invalidates existing caches exactly once (cache miss → recompute, never a wrong result), which is what makes it safe to land the unification and collision fixes together.

Why 128-bit is sufficient (folding sha224 → md5)

Replacing the wider sha224 with a 128-bit digest is safe here:

  • Collision resistance is about digest width, not cryptographic strength. For a fingerprint the only property that matters is the probability that two distinct inputs map to the same digest, governed by the birthday bound (~2^(n/2)) for a well-distributed n-bit hash. These fingerprints are never a security boundary — no adversary chooses inputs to force a collision — so sha224's extra resistance to deliberate attacks buys nothing here.
  • 128 bits is astronomically sufficient for cache keys. The birthday bound for 128 bits is ~2^64 (≈1.8×10¹⁹) distinct values before a ~50% collision chance. A cache will never hold anywhere near that; the realistic collision probability is effectively zero.

This PR keeps md5 (still cryptographic) — it does not claim a throughput win. The point is solely to collapse to one digest so the algorithm becomes swappable in one line. The non-cryptographic, faster xxh3_128 swap is the third PR.

Testing

  • Pinned literal-digest tests are restructured into a portability / algorithm-stability guard: literals are kept only for version-portable types (primitives, sequences, mappings, sets, numpy — whose digest is a function of the value's representation and the algorithm alone) and recomputed against the unified md5 (run, not hand-written). The numpy case pins its dtype so the literal is reproducible across platforms.
  • Version-sensitive DataFrame digests (which depend on library-version-specific dtype reprs) are not pinned; they are covered by relational must-differ / must-match tests instead of brittle literals.
  • New must-differ test for cross-type primitives (1, "1", b"1", 1.0, "1.0" all distinct).
  • Full caching suite passes. Polars-dependent tests are guarded with pytest.importorskip (the polars wheel crashes on hosts lacking certain CPU features), so they run on CI rather than in every local environment.

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Introduce a single `_hash_bytes(data: bytes) -> str` helper that all
fingerprinting paths route through, so the underlying hashing algorithm
can later be changed in exactly one place. As part of unifying on a
single chokepoint, the sha224 used in the sequence/mapping/set paths is
folded into the md5 already used elsewhere; digest width is unchanged.

Close confirmed fingerprint collisions by tagging primitives and bytes
with their type, so `1`, `"1"`, `b"1"`, `1.0` and `"1.0"` hash
distinctly. Add a cross-type must-differ test.

Restructure the digest tests: literal-digest pins are kept only for
version-portable types (primitives, sequences, mappings, sets, numpy)
and labelled as an algorithm-portability guard; the numpy case pins its
dtype so the literal is reproducible across platforms. Version-sensitive
DataFrame digests are covered by relational must-differ / must-match
tests instead of brittle literals.

This is a fingerprint-changing refactor: cached fingerprints from prior
versions will no longer match and will be recomputed. It is the
groundwork that makes vectorized DataFrame hashing and a later
non-cryptographic algorithm swap minimal, isolated changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@jernejfrank jernejfrank left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jernejfrank jernejfrank merged commit 14c7c23 into apache:main Jun 8, 2026
6 checks passed
@Dev-iL Dev-iL deleted the 2606/hash_type_tagging branch June 8, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants