Skip to content

goldenmatch 1.30.0

Choose a tag to compare

@benzsevern benzsevern released this 09 Jun 17:47
2b59828

goldenmatch 1.30.0

New since 1.29.0:

  • Native PPRL bloom CLK kernel (opt-in, default off). New goldenmatch-native
    symbol bloom_clk_batch (rayon + GIL-release, 256-bit Cryptographic Longterm
    Key encoding) accelerates the PPRL bloom_filter transform. Reachable via
    GOLDENMATCH_NATIVE=1; pure-Python stays the reproducible default and the
    graceful fallback when the symbol is absent. Needs goldenmatch-native 0.1.5
    (released separately). (#826)

  • Probabilistic EM training-pair sampling is now deterministic (#829).
    _sample_blocked_pairs seeded-shuffled bare block indices whose order was
    itself non-deterministic (parallel / hash-bucketed construction), so the EM
    training sample (and thus the m/u weights, threshold, and precision/recall)
    varied run-to-run. On one CI run, three invocations of the identical
    probabilistic path gave historical_50k pairwise F1 of 0.805 / 0.779 / 0.643.
    The fix sorts blocks by their stable block_key before the seeded shuffle;
    post-fix the three bench harnesses agree within 0.002. The committed Splink
    head-to-head and bake-off numbers are now deterministic (see
    docs/benchmarks/2026-06-09-splink-bakeoff.md). The previously published
    dblp_acm = 0.879 was a non-deterministic lucky draw; the reproducible value
    is 0.377 -- use the weighted path for bibliographic data (0.964 on DBLP-ACM).

Full changelog: packages/python/goldenmatch/CHANGELOG.md