Skip to content

Vectorize pandas and polars DataFrame hashing#1629

Open
Dev-iL wants to merge 1 commit into
apache:mainfrom
SummitSG-LLC:2606/vectorized_hash
Open

Vectorize pandas and polars DataFrame hashing#1629
Dev-iL wants to merge 1 commit into
apache:mainfrom
SummitSG-LLC:2606/vectorized_hash

Conversation

@Dev-iL

@Dev-iL Dev-iL commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Split 2 of 3 of #1619 (stacked on #1628)

  1. centralize hashing through a single chokepoint + close type collisions (Centralize hashing through _hash_bytes and tag value types #1628).
  2. this PR — vectorize the pandas/polars DataFrame paths.
  3. swap the algorithm to xxhash.xxh3_128.

What this does

Replaces the Python-level row iteration in the pandas and polars fingerprint paths with a single vectorized buffer hash, keeping everything else (the _hash_bytes chokepoint, the algorithm, the collision-prevention type tags) unchanged from PR1.

pandashash_pandas_object(obj).to_dict() fed through an ordered hash_mapping (one Python hash_value call per row) → hash_pandas_object(obj).values.tobytes() hashed in one _hash_bytes call. Column names and dtypes are folded into the hash so frames with identical cell values but different schemas do not collide. The old docstring claiming row order "doesn't matter" was incorrect; the new path is explicitly order-sensitive.

polarsobj.hash_rows().to_list() fed through hash_sequence (one Python hash_value call per element) → obj.hash_rows().to_numpy().tobytes() hashed in one _hash_bytes call. The schema_hash + row_hash combine from #1616/#1628 is preserved.

The change is purely mechanical: replace the Python loop with a buffer read. No algorithm change, no new dependency.

Benchmark

Benchmark code

# benchmark_fingerprinting.py
"""Corroborating benchmark for the DataFrame fingerprinting speedup.

Compares the new vectorized hashing in
:mod:`hamilton.caching.fingerprinting` against the old per-row approach for
both supported DataFrame backends:

- pandas: vectorized hash of the ``hash_pandas_object(...).values`` buffer vs
  ``.to_dict()`` fed through an ordered ``hash_mapping`` (Python row loop).
- polars: vectorized hash of the ``hash_rows().to_numpy()`` buffer vs
  ``hash_rows().to_list()`` fed through a per-element ``hash_sequence``.

The structural "no per-row loop" assertions in the test suite are the hard
gate; this script is corroborating evidence with a generous floor to avoid the
flakiness of an absolute-time threshold. Run directly:

    python benchmarks/benchmark_fingerprinting.py
"""

import os
import platform
import re
import time

import pandas as pd

from hamilton.caching import fingerprinting as fp

# Swept so the speedup can be read as a function of frame size. Note it is not
# monotonic: at very large sizes the shared cost both paths pay (pandas'
# hash_pandas_object) dominates, narrowing the pandas speedup.
SIZES = (500, 5_000, 50_000, 500_000, 5_000_000)


def _cpu_model() -> str:
    try:
        with open("/proc/cpuinfo") as f:
            for line in f:
                if line.startswith("model name"):
                    return line.split(":", 1)[1].strip()
    except OSError:
        pass
    return platform.processor() or "unknown"


def _ram_info() -> str:
    import subprocess

    total = "unknown"
    try:
        with open("/proc/meminfo") as f:
            for line in f:
                if line.startswith("MemTotal"):
                    kb = int(re.search(r"\d+", line).group())
                    total = f"{kb / 1024 / 1024:.0f} GiB"
                    break
    except OSError:
        pass
    detail = ""
    try:
        out = subprocess.check_output(
            ["dmidecode", "-t", "memory"], text=True, stderr=subprocess.DEVNULL,
        )
        typ = re.search(r"^\s*Type:\s*(DDR\S*)", out, re.MULTILINE)
        spd = re.search(r"^\s*Configured Memory Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        if not spd:
            spd = re.search(r"^\s*Speed:\s*(\d+\s*\S+)", out, re.MULTILINE)
        parts = [m.group(1) for m in (typ, spd) if m]
        if parts:
            detail = f" ({' '.join(parts)})"
    except (OSError, subprocess.CalledProcessError):
        pass
    return f"{total}{detail}"


def _print_env() -> None:
    import sys

    import polars as pl

    print("== Environment ==")
    print(f"  CPU      : {_cpu_model()} ({os.cpu_count()} logical cores)")
    print(f"  RAM      : {_ram_info()}")
    print(f"  Platform : {platform.system()} {platform.release()} ({platform.machine()})")
    print(f"  Python   : {sys.version.split()[0]}")
    print(f"  pandas   : {pd.__version__}")
    print(f"  polars   : {pl.__version__}")
    print()


def _old_hash_pandas_obj(obj) -> str:
    """The pre-change per-row pandas implementation, kept here for comparison."""
    from pandas.util import hash_pandas_object

    hash_per_row = hash_pandas_object(obj)
    return fp.hash_mapping(hash_per_row.to_dict(), ignore_order=False, depth=1)


def _old_hash_polars_dataframe(obj) -> str:
    """The pre-change per-element polars implementation, kept here for comparison."""
    schema_str = ",".join(f"{name}:{dtype}" for name, dtype in obj.schema.items())
    schema_hash = fp.hash_bytes(schema_str.encode())
    row_hash = fp.hash_sequence(obj.hash_rows().to_list(), depth=1)
    return fp._hash_bytes(schema_hash.encode() + row_hash.encode())


def _time(fn, obj, repeats: int = 3) -> float:
    fn(obj)  # warmup: prime caches / lazy imports before timing
    best = float("inf")
    for _ in range(repeats):
        start = time.perf_counter()
        fn(obj)
        best = min(best, time.perf_counter() - start)
    return best


def _report(label: str, n_rows: int, old_fn, new_fn, obj) -> None:
    old = _time(old_fn, obj)
    new = _time(new_fn, obj)
    print(
        f"[{label} n={n_rows:>9,}] old per-row {old * 1e3:9.1f} ms  "
        f"vectorized {new * 1e3:8.1f} ms  speedup {old / new:6.1f}x"
    )


def _columns(n_rows: int) -> dict:
    return {
        "a": range(n_rows),
        "b": [float(i) for i in range(n_rows)],
        "c": [f"row-{i}" for i in range(n_rows)],
    }


def main() -> None:
    _print_env()

    try:
        import polars as pl
    except ImportError:
        pl = None
        print("[polars] not installed; skipping polars")

    for n_rows in SIZES:
        columns = _columns(n_rows)
        _report("pandas", n_rows, _old_hash_pandas_obj, fp.hash_pandas_obj, pd.DataFrame(columns))
        if pl is not None:
            _report(
                "polars", n_rows, _old_hash_polars_dataframe, fp.hash_polars_dataframe,
                pl.DataFrame(columns),
            )


if __name__ == "__main__":
    main()

Plotting code

# plot_benchmarks.py
"""Generate Plotly charts for PR2 and PR3 benchmark results.

Reads hardcoded benchmark data (not recomputed) and writes two PNG files
into the same directory as this script.

    python benchmarks/plot_benchmarks.py
"""

from pathlib import Path

import plotly.graph_objects as go

OUT_DIR = Path(__file__).parent

SIZE_LABELS = ["500", "5 K", "50 K", "500 K", "5 M"]

COLORS = {
    "pandas": "#1f77b4",
    "polars": "#ff7f0e",
    "numpy": "#2ca02c",
}

# ── PR2 data (baseline → vectorized) ────────────────────────────────────
PR2 = {
    "pandas": [4.2, 11.0, 12.5, 9.8, 6.0],
    "polars": [13.1, 78.1, 382.0, 431.0, 209.0],
}

# ── PR3 data (md5 → xxh3_128, end-to-end) ───────────────────────────────
PR3 = {
    "numpy": [1.36, 3.21, 6.45, 1.44, 1.55],
    "pandas": [1.03, 0.88, 0.98, 1.00, 1.00],
    "polars": [1.25, 1.84, 2.26, 2.26, 1.67],
}


def _build_chart(data: dict, title: str, log_y: bool = False) -> go.Figure:
    fig = go.Figure()

    for backend, speedups in data.items():
        fig.add_trace(
            go.Scatter(
                name=backend,
                x=SIZE_LABELS,
                y=speedups,
                mode="lines+markers+text",
                text=[f"{s:.1f}×" for s in speedups],
                textposition="top center",
                textfont=dict(size=12),
                line=dict(color=COLORS[backend], width=3),
                marker=dict(size=10),
            )
        )

    fig.update_layout(
        title=dict(text=title, x=0.5, y=0.95, yanchor="top"),
        legend=dict(
            yanchor="top",
            y=0.98,
            xanchor="left",
            x=0.02,
            bgcolor="rgba(255,255,255,0.8)",
        ),
        template="plotly_white",
        height=500,
        width=800,
        margin=dict(t=60, b=60),
    )
    fig.update_xaxes(title_text="Input size (rows)")
    fig.update_yaxes(title_text="Speedup (×)", type="log" if log_y else "linear")

    return fig


def main() -> None:
    pr2_fig = _build_chart(
        PR2,
        title="PR2: Vectorized hashing — speedup over per-row baseline",
        log_y=True,
    )
    pr2_path = OUT_DIR / "PR2_chart.png"
    pr2_fig.write_image(str(pr2_path), scale=2)
    print(f"Wrote {pr2_path}")

    pr3_fig = _build_chart(
        PR3,
        title="PR3: xxh3_128 vs md5 — end-to-end speedup",
    )
    pr3_path = OUT_DIR / "PR3_chart.png"
    pr3_fig.write_image(str(pr3_path), scale=2)
    print(f"Wrote {pr3_path}")


if __name__ == "__main__":
    main()

benchmark_fingerprinting.py on a 3-column DataFrame (int + float + string), warmup + best-of-3:

rows backend baseline (per-row) vectorized speedup
500 pandas 4.3 ms 1.0 ms 4.2×
500 polars 1.6 ms 0.1 ms 13.1×
5,000 pandas 33.0 ms 3.0 ms 11.0×
5,000 polars 14.2 ms 0.2 ms 78.1×
50,000 pandas 319 ms 25.5 ms 12.5×
50,000 polars 135 ms 0.4 ms 382×
500,000 pandas 3,326 ms 339 ms 9.8×
500,000 polars 1,467 ms 3.4 ms 431×
5,000,000 pandas 36,173 ms 5,989 ms 6.0×
5,000,000 polars 14,595 ms 69.7 ms 209×

CPU: Intel i7-3770 @ 3.40 GHz · 32 GiB DDR3-1600 · pandas 3.0.3 · polars 1.41.2 · Python 3.14.2

PR2_chart

The speedup is not monotonic: pandas narrows at large sizes because hash_pandas_object dominates both paths (the vectorization removes the per-row loop that follows it, not the hash_pandas_object call itself). Polars stays enormous because hash_rows() is fast and the old .to_list() → per-element loop was the bottleneck.

Testing

  • New: test_hash_pandas_different_columns_differ — identical values under different column names must hash differently (pandas analog of the existing polars test).
  • New: test_hash_pandas_different_dtypes_differ / test_hash_polars_different_dtypes_differ — identical values under different dtypes must hash differently.
  • New: test_hash_pandas_order_sensitive — reordering rows changes the fingerprint.
  • Existing pinned-digest tests and relational must-differ/must-match tests pass unchanged (the algorithm is the same md5 as PR1; only the construction path changed for DataFrames, whose digests are tested relationally, not with pinned literals).

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@jernejfrank

Copy link
Copy Markdown
Contributor

ah sorry, I squash-merged. Can you rebase-onto?

Replace the per-row Python loops in the DataFrame fingerprinting paths
with single-buffer hashing:

- pandas: hash the `hash_pandas_object(obj).values` uint64 buffer in one
  shot instead of round-tripping through `.to_dict()` and an ordered
  `hash_mapping`; fold column names + dtypes (schema) into the hash so
  frames with identical values but different schemas no longer collide;
  keep the path order-sensitive.
- polars: hash the `hash_rows().to_numpy()` buffer in one shot instead of
  `.to_list()` through a per-element `hash_sequence` loop.

Both paths route through the existing `_hash_bytes` chokepoint, so the
algorithm is unchanged here. The DataFrame digest is deliberately not
pinned to a literal (it depends on library-version-specific dtype reprs);
coverage is via relational schema-collision, dtype-collision and
order-sensitivity tests for both backends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Dev-iL Dev-iL force-pushed the 2606/vectorized_hash branch from 527c8aa to 045a5db Compare June 8, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants