Skip to content

goldenmatch 1.29.0

Choose a tag to compare

@benzsevern benzsevern released this 09 Jun 03:38
94fa6dd

goldenmatch 1.29.0

Probabilistic (Fellegi-Sunter) auto-config v2 -- default-on; GOLDENMATCH_FS_AUTOCONFIG_V2=0 restores the legacy field set byte-identically.

Under the shared bench_er_headtohead evaluator (pairwise F1), the probabilistic auto-config path now matches or beats Splink on every measured ER dataset:

Dataset goldenmatch v2 Splink
historical_50k 0.779 0.757
febrl3 0.991 0.965
synthetic_person 0.998 0.996
dblp_acm 0.879 (skips)

Levers (probabilistic auto-config path only; the weighted and zero-config dedupe_df paths are untouched): admit dob/date columns as a levenshtein discriminator; drop redundant name composites when atomic given+family exist; additively diversify blocking onto orthogonal stable keys (date-year + postcode/zip); admit description/multi_name as token_sort.

Note: these are pairwise F1 under one shared evaluator; the often-cited ~0.97 Splink figure on historical_50k is a cluster-level metric, not within-cluster pairwise F1.

Wheel and sdist are cosign-keyless signed (sigstore bundles attached) with a build-provenance attestation.