Release v1.26.0 - distributed pipeline scales to 100M · benseverndev-oss/goldenmatch

GoldenMatch 1.26.0 - distributed pipeline scales to 100M

Verified on a real 4-worker GCP Ray cluster (e2-standard-16, 64 worker CPU):
a full 100,000,000-row dedupe in 213 s wall, 20,000,000 golden records,
driver process peak 0.30 GB RSS.

The distributed pipeline (GOLDENMATCH_DISTRIBUTED_PIPELINE=2) is now
driver-collect-free end to end:

score -> per-partition local connected-components -> distributed join
-> distributed golden build -> distributed write

Nothing funnels back to a single node, which is what unlocks 100M.

Added

local_cc_assignments: connected components via a single per-partition
local Union-Find map_batches. Distributed scoring is per-partition, so a
component's edges are always co-located in one block; a local Union-Find
yields the global components with no cross-node merge, no driver collect,
no iterative graph algorithm.
End-to-end distributed pipeline: rows annotated with cluster_id via a
distributed Dataset.join (not a broadcast dict); golden records built AND
written distributed (build_golden_records_distributed + write_parquet),
never materialized on the driver.

Fixed

Cross-partition cluster-id collision: the pipeline synthesized row_id
per-partition, so ids collided across partitions and connected-components
silently merged unrelated clusters (a 50M run reported ~156K clusters vs
the true ~10M). The generator now carries a global row_id, respected
by the pipeline.

See CHANGELOG.md and scripts/bench_phase5_explicit.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.26.0 - distributed pipeline scales to 100M

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!