v1.26.0 - distributed pipeline scales to 100M
GoldenMatch 1.26.0 - distributed pipeline scales to 100M
Verified on a real 4-worker GCP Ray cluster (e2-standard-16, 64 worker CPU):
a full 100,000,000-row dedupe in 213 s wall, 20,000,000 golden records,
driver process peak 0.30 GB RSS.
The distributed pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) is now
driver-collect-free end to end:
score -> per-partition local connected-components -> distributed join
-> distributed golden build -> distributed write
Nothing funnels back to a single node, which is what unlocks 100M.
Added
- local_cc_assignments: connected components via a single per-partition
local Union-Find map_batches. Distributed scoring is per-partition, so a
component's edges are always co-located in one block; a local Union-Find
yields the global components with no cross-node merge, no driver collect,
no iterative graph algorithm. - End-to-end distributed pipeline: rows annotated with cluster_id via a
distributed Dataset.join (not a broadcast dict); golden records built AND
written distributed (build_golden_records_distributed + write_parquet),
never materialized on the driver.
Fixed
- Cross-partition cluster-id collision: the pipeline synthesized row_id
per-partition, so ids collided across partitions and connected-components
silently merged unrelated clusters (a 50M run reported ~156K clusters vs
the true ~10M). The generator now carries a global row_id, respected
by the pipeline.
See CHANGELOG.md and scripts/bench_phase5_explicit.py.