Vectorize pandas and polars DataFrame hashing#1629
Open
Dev-iL wants to merge 1 commit into
Open
Conversation
7 tasks
Contributor
|
ah sorry, I squash-merged. Can you rebase-onto? |
Replace the per-row Python loops in the DataFrame fingerprinting paths with single-buffer hashing: - pandas: hash the `hash_pandas_object(obj).values` uint64 buffer in one shot instead of round-tripping through `.to_dict()` and an ordered `hash_mapping`; fold column names + dtypes (schema) into the hash so frames with identical values but different schemas no longer collide; keep the path order-sensitive. - polars: hash the `hash_rows().to_numpy()` buffer in one shot instead of `.to_list()` through a per-element `hash_sequence` loop. Both paths route through the existing `_hash_bytes` chokepoint, so the algorithm is unchanged here. The DataFrame digest is deliberately not pinned to a literal (it depends on library-version-specific dtype reprs); coverage is via relational schema-collision, dtype-collision and order-sensitivity tests for both backends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
527c8aa to
045a5db
Compare
jernejfrank
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Split 2 of 3 of #1619 (stacked on #1628)
_hash_bytesand tag value types #1628).xxhash.xxh3_128.What this does
Replaces the Python-level row iteration in the pandas and polars fingerprint paths with a single vectorized buffer hash, keeping everything else (the
_hash_byteschokepoint, the algorithm, the collision-prevention type tags) unchanged from PR1.pandas —
hash_pandas_object(obj).to_dict()fed through an orderedhash_mapping(one Pythonhash_valuecall per row) →hash_pandas_object(obj).values.tobytes()hashed in one_hash_bytescall. Column names and dtypes are folded into the hash so frames with identical cell values but different schemas do not collide. The old docstring claiming row order "doesn't matter" was incorrect; the new path is explicitly order-sensitive.polars —
obj.hash_rows().to_list()fed throughhash_sequence(one Pythonhash_valuecall per element) →obj.hash_rows().to_numpy().tobytes()hashed in one_hash_bytescall. Theschema_hash + row_hashcombine from #1616/#1628 is preserved.The change is purely mechanical: replace the Python loop with a buffer read. No algorithm change, no new dependency.
Benchmark
Benchmark code
Plotting code
benchmark_fingerprinting.pyon a 3-column DataFrame (int + float + string), warmup + best-of-3:The speedup is not monotonic: pandas narrows at large sizes because
hash_pandas_objectdominates both paths (the vectorization removes the per-row loop that follows it, not thehash_pandas_objectcall itself). Polars stays enormous becausehash_rows()is fast and the old.to_list()→ per-element loop was the bottleneck.Testing
test_hash_pandas_different_columns_differ— identical values under different column names must hash differently (pandas analog of the existing polars test).test_hash_pandas_different_dtypes_differ/test_hash_polars_different_dtypes_differ— identical values under different dtypes must hash differently.test_hash_pandas_order_sensitive— reordering rows changes the fingerprint.Checklist