Vectorize pandas and polars DataFrame hashing by Dev-iL · Pull Request #1629 · apache/hamilton

Dev-iL · 2026-06-08T10:54:50Z

Split 2 of 3 of #1619 (stacked on #1628)

centralize hashing through a single chokepoint + close type collisions (Centralize hashing through _hash_bytes and tag value types #1628).
this PR — vectorize the pandas/polars DataFrame paths.
swap the algorithm to xxhash.xxh3_128.

What this does

Replaces the Python-level row iteration in the pandas and polars fingerprint paths with a single vectorized buffer hash, keeping everything else (the _hash_bytes chokepoint, the algorithm, the collision-prevention type tags) unchanged from PR1.

pandas — hash_pandas_object(obj).to_dict() fed through an ordered hash_mapping (one Python hash_value call per row) → hash_pandas_object(obj).values.tobytes() hashed in one _hash_bytes call. Column names and dtypes are folded into the hash so frames with identical cell values but different schemas do not collide. The old docstring claiming row order "doesn't matter" was incorrect; the new path is explicitly order-sensitive.

polars — obj.hash_rows().to_list() fed through hash_sequence (one Python hash_value call per element) → obj.hash_rows().to_numpy().tobytes() hashed in one _hash_bytes call. The schema_hash + row_hash combine from #1616/#1628 is preserved.

The change is purely mechanical: replace the Python loop with a buffer read. No algorithm change, no new dependency.

Benchmark

Benchmark code

# benchmark_fingerprinting.py
"""Corroborating benchmark for the DataFrame fingerprinting speedup.

Compares the new vectorized hashing in
:mod:`hamilton.caching.fingerprinting` against the old per-row approach for
both supported DataFrame backends:

- pandas: vectorized hash of the ``hash_pandas_object(...).values`` buffer vs
  ``.to_dict()`` fed through an ordered ``hash_mapping`` (Python row loop).
- polars: vectorized hash of the ``hash_rows().to_numpy()`` buffer vs
  ``hash_rows().to_list()`` fed through a per-element ``hash_sequence``.

The structural "no per-row loop" assertions in the test suite are the hard
gate; this script is corroborating evidence with a generous floor to avoid the
flakiness of an absolute-time threshold. Run directly:

    python benchmarks/benchmark_fingerprinting.py
"""

import os
import platform
import re
import time

import pandas as pd

from hamilton.caching import fingerprinting as fp

# Swept so the speedup can be read as a function of frame size. Note it is not
# monotonic: at very large sizes the shared cost both paths pay (pandas'
# hash_pandas_object) dominates, narrowing the pandas speedup.
SIZES = (500, 5_000, 50_000, 500_000, 5_000_000)


def _cpu_model() -> str:
    try:
        with open("/proc/cpuinfo") as f:
            for line in f:
                if line.startswith("model name"):
                    return line.split(":", 1)[1].strip()
    except OSError:
        pass
    return platform.processor() or "unknown"


def _ram_info() -> str:
    import subprocess

    total = "unknown"
    try:
        with open("/proc/meminfo") as f:
            for line in f:
                if line.startswith("MemTotal"):
                    kb = int(re.search(r"\d+", line).group())
                    total = f"{kb / 1024 / 1024:.0f} GiB"
                    break
    except OSError:
        pass
    detail = ""
    try:
        out = subprocess.check_output(
            ["dmidecode", "-t", "memory"], text=True, stderr=subprocess.DEVNULL,
        )
        typ = re.search(r"^\s*Type:\s*(DDR\S*)", out, re.MULTILINE)
        spd = re.search(r"^\s*Configured Memory Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        if not spd:
            spd = re.search(r"^\s*Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        parts = [m.group(1) for m in (typ, spd) if m]
        if parts:
            detail = f" ({' '.join(parts)})"
    except (OSError, subprocess.CalledProcessError):
        pass
    return f"{total}{detail}"


def _print_env() -> None:
    import sys

    import polars as pl

    print("== Environment ==")
    print(f"  CPU      : {_cpu_model()} ({os.cpu_count()} logical cores)")
    print(f"  RAM      : {_ram_info()}")
    print(f"  Platform : {platform.system()} {platform.release()} ({platform.machine()})")
    print(f"  Python   : {sys.version.split()[0]}")
    print(f"  pandas   : {pd.__version__}")
    print(f"  polars   : {pl.__version__}")
    print()


def _old_hash_pandas_obj(obj) -> str:
    """The pre-change per-row pandas implementation, kept here for comparison."""
    from pandas.util import hash_pandas_object

    hash_per_row = hash_pandas_object(obj)
    return fp.hash_mapping(hash_per_row.to_dict(), ignore_order=False, depth=1)


def _old_hash_polars_dataframe(obj) -> str:
    """The pre-change per-element polars implementation, kept here for comparison."""
    schema_str = ",".join(f"{name}:{dtype}" for name, dtype in obj.schema.items())
    schema_hash = fp.hash_bytes(schema_str.encode())
    row_hash = fp.hash_sequence(obj.hash_rows().to_list(), depth=1)
    return fp._hash_bytes(schema_hash.encode() + row_hash.encode())


def _time(fn, obj, repeats: int = 3) -> float:
    fn(obj)  # warmup: prime caches / lazy imports before timing
    best = float("inf")
    for _ in range(repeats):
        start = time.perf_counter()
        fn(obj)
        best = min(best, time.perf_counter() - start)
    return best


def _report(label: str, n_rows: int, old_fn, new_fn, obj) -> None:
    old = _time(old_fn, obj)
    new = _time(new_fn, obj)
    print(
        f"[{label} n={n_rows:>9,}] old per-row {old * 1e3:9.1f} ms  "
        f"vectorized {new * 1e3:8.1f} ms  speedup {old / new:6.1f}x"
    )


def _columns(n_rows: int) -> dict:
    return {
        "a": range(n_rows),
        "b": [float(i) for i in range(n_rows)],
        "c": [f"row-{i}" for i in range(n_rows)],
    }


def main() -> None:
    _print_env()

    try:
        import polars as pl
    except ImportError:
        pl = None
        print("[polars] not installed; skipping polars")

    for n_rows in SIZES:
        columns = _columns(n_rows)
        _report("pandas", n_rows, _old_hash_pandas_obj, fp.hash_pandas_obj, pd.DataFrame(columns))
        if pl is not None:
            _report(
                "polars", n_rows, _old_hash_polars_dataframe, fp.hash_polars_dataframe,
                pl.DataFrame(columns),
            )


if __name__ == "__main__":
    main()

Plotting code

# plot_benchmarks.py
"""Generate Plotly charts for PR2 and PR3 benchmark results.

Reads hardcoded benchmark data (not recomputed) and writes two PNG files
into the same directory as this script.

    python benchmarks/plot_benchmarks.py
"""

from pathlib import Path

import plotly.graph_objects as go

OUT_DIR = Path(__file__).parent

SIZE_LABELS = ["500", "5 K", "50 K", "500 K", "5 M"]

COLORS = {
    "pandas": "#1f77b4",
    "polars": "#ff7f0e",
    "numpy": "#2ca02c",
}

# ── PR2 data (baseline → vectorized) ────────────────────────────────────
PR2 = {
    "pandas": [4.2, 11.0, 12.5, 9.8, 6.0],
    "polars": [13.1, 78.1, 382.0, 431.0, 209.0],
}

# ── PR3 data (md5 → xxh3_128, end-to-end) ───────────────────────────────
PR3 = {
    "numpy": [1.36, 3.21, 6.45, 1.44, 1.55],
    "pandas": [1.03, 0.88, 0.98, 1.00, 1.00],
    "polars": [1.25, 1.84, 2.26, 2.26, 1.67],
}


def _build_chart(data: dict, title: str, log_y: bool = False) -> go.Figure:
    fig = go.Figure()

    for backend, speedups in data.items():
        fig.add_trace(
            go.Scatter(
                name=backend,
                x=SIZE_LABELS,
                y=speedups,
                mode="lines+markers+text",
                text=[f"{s:.1f}×" for s in speedups],
                textposition="top center",
                textfont=dict(size=12),
                line=dict(color=COLORS[backend], width=3),
                marker=dict(size=10),
            )
        )

    fig.update_layout(
        title=dict(text=title, x=0.5, y=0.95, yanchor="top"),
        legend=dict(
            yanchor="top",
            y=0.98,
            xanchor="left",
            x=0.02,
            bgcolor="rgba(255,255,255,0.8)",
        ),
        template="plotly_white",
        height=500,
        width=800,
        margin=dict(t=60, b=60),
    )
    fig.update_xaxes(title_text="Input size (rows)")
    fig.update_yaxes(title_text="Speedup (×)", type="log" if log_y else "linear")

    return fig


def main() -> None:
    pr2_fig = _build_chart(
        PR2,
        title="PR2: Vectorized hashing — speedup over per-row baseline",
        log_y=True,
    )
    pr2_path = OUT_DIR / "PR2_chart.png"
    pr2_fig.write_image(str(pr2_path), scale=2)
    print(f"Wrote {pr2_path}")

    pr3_fig = _build_chart(
        PR3,
        title="PR3: xxh3_128 vs md5 — end-to-end speedup",
    )
    pr3_path = OUT_DIR / "PR3_chart.png"
    pr3_fig.write_image(str(pr3_path), scale=2)
    print(f"Wrote {pr3_path}")


if __name__ == "__main__":
    main()

benchmark_fingerprinting.py on a 3-column DataFrame (int + float + string), warmup + best-of-3:

rows	backend	baseline (per-row)	vectorized	speedup
500	pandas	4.3 ms	1.0 ms	4.2×
500	polars	1.6 ms	0.1 ms	13.1×
5,000	pandas	33.0 ms	3.0 ms	11.0×
5,000	polars	14.2 ms	0.2 ms	78.1×
50,000	pandas	319 ms	25.5 ms	12.5×
50,000	polars	135 ms	0.4 ms	382×
500,000	pandas	3,326 ms	339 ms	9.8×
500,000	polars	1,467 ms	3.4 ms	431×
5,000,000	pandas	36,173 ms	5,989 ms	6.0×
5,000,000	polars	14,595 ms	69.7 ms	209×

CPU: Intel i7-3770 @ 3.40 GHz · 32 GiB DDR3-1600 · pandas 3.0.3 · polars 1.41.2 · Python 3.14.2

The speedup is not monotonic: pandas narrows at large sizes because hash_pandas_object dominates both paths (the vectorization removes the per-row loop that follows it, not the hash_pandas_object call itself). Polars stays enormous because hash_rows() is fast and the old .to_list() → per-element loop was the bottleneck.

Testing

New: test_hash_pandas_different_columns_differ — identical values under different column names must hash differently (pandas analog of the existing polars test).
New: test_hash_pandas_different_dtypes_differ / test_hash_polars_different_dtypes_differ — identical values under different dtypes must hash differently.
New: test_hash_pandas_order_sensitive — reordering rows changes the fingerprint.
Existing pinned-digest tests and relational must-differ/must-match tests pass unchanged (the algorithm is the same md5 as PR1; only the construction path changed for DataFrames, whose digests are tested relationally, not with pinned literals).

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

jernejfrank · 2026-06-08T12:06:37Z

ah sorry, I squash-merged. Can you rebase-onto?

Replace the per-row Python loops in the DataFrame fingerprinting paths with single-buffer hashing: - pandas: hash the `hash_pandas_object(obj).values` uint64 buffer in one shot instead of round-tripping through `.to_dict()` and an ordered `hash_mapping`; fold column names + dtypes (schema) into the hash so frames with identical values but different schemas no longer collide; keep the path order-sensitive. - polars: hash the `hash_rows().to_numpy()` buffer in one shot instead of `.to_list()` through a per-element `hash_sequence` loop. Both paths route through the existing `_hash_bytes` chokepoint, so the algorithm is unchanged here. The DataFrame digest is deliberately not pinned to a literal (it depends on library-version-specific dtype reprs); coverage is via relational schema-collision, dtype-collision and order-sensitivity tests for both backends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Dev-iL requested review from elijahbenizzy, jernejfrank and skrawcz June 8, 2026 10:55

Dev-iL mentioned this pull request Jun 8, 2026

Switch the fingerprint algo to xxh3_128 #1630

Open

7 tasks

Dev-iL force-pushed the 2606/vectorized_hash branch from 527c8aa to 045a5db Compare June 8, 2026 12:07

jernejfrank approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize pandas and polars DataFrame hashing#1629

Vectorize pandas and polars DataFrame hashing#1629
Dev-iL wants to merge 1 commit into
apache:mainfrom
SummitSG-LLC:2606/vectorized_hash

Dev-iL commented Jun 8, 2026

Uh oh!

jernejfrank commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dev-iL commented Jun 8, 2026

What this does

Benchmark

Testing

Checklist

Uh oh!

jernejfrank commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants