Skip to content

v1.26.0 - distributed pipeline scales to 100M

Choose a tag to compare

@benzsevern benzsevern released this 04 Jun 05:57
ac36dbd

GoldenMatch 1.26.0 - distributed pipeline scales to 100M

Verified on a real 4-worker GCP Ray cluster (e2-standard-16, 64 worker CPU):
a full 100,000,000-row dedupe in 213 s wall, 20,000,000 golden records,
driver process peak 0.30 GB RSS.

The distributed pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) is now
driver-collect-free end to end:

score -> per-partition local connected-components -> distributed join
-> distributed golden build -> distributed write

Nothing funnels back to a single node, which is what unlocks 100M.

Added

  • local_cc_assignments: connected components via a single per-partition
    local Union-Find map_batches. Distributed scoring is per-partition, so a
    component's edges are always co-located in one block; a local Union-Find
    yields the global components with no cross-node merge, no driver collect,
    no iterative graph algorithm.
  • End-to-end distributed pipeline: rows annotated with cluster_id via a
    distributed Dataset.join (not a broadcast dict); golden records built AND
    written distributed (build_golden_records_distributed + write_parquet),
    never materialized on the driver.

Fixed

  • Cross-partition cluster-id collision: the pipeline synthesized row_id
    per-partition, so ids collided across partitions and connected-components
    silently merged unrelated clusters (a 50M run reported ~156K clusters vs
    the true ~10M). The generator now carries a global row_id, respected
    by the pipeline.

See CHANGELOG.md and scripts/bench_phase5_explicit.py.