v0.3.0 -- LLM Scoring, Fellegi-Sunter, Plugins, Connectors, Streaming, Graph ER
What's New
LLM Scorer with Budget Controls
- GPT-4o-mini scores borderline pairs, boosting product matching from 44.5% to 66.3% F1 (precision 35% -> 95%) for $0.04
- Budget caps (
max_cost_usd,max_calls), model tiering, graceful degradation - Three-tier: auto-accept (>0.95), LLM judge (0.75-0.95), auto-reject (<0.75)
Fellegi-Sunter Probabilistic Model
- EM-trained m/u probabilities with Splink-style training (fix u from random pairs, train only m)
- Comparison vectors with 2/3/N levels, automatic threshold estimation
- 98.8% precision on DBLP-ACM -- opt-in for high-precision use cases
Plugin Architecture
- Extend with custom scorers, transforms, connectors, and golden strategies
- Entry-point discovery:
pip install goldenmatch-my-pluginauto-registers - Protocol classes:
ScorerPlugin,TransformPlugin,ConnectorPlugin,GoldenStrategyPlugin
Learned Blocking
- Auto-discovers blocking predicates from a sample run
- Evaluates recall vs reduction ratio, selects best rules
- 96.9% F1 matching hand-tuned static blocking on DBLP-ACM
Enterprise Connectors
- Snowflake, Databricks, BigQuery, HubSpot, Salesforce
- Optional deps:
pip install goldenmatch[snowflake] - Credentials via environment variables
Explainability
- Template-based natural language explanations (zero LLM cost)
- Per-pair: "Matched because names are phonetically identical, zip codes match exactly"
- Per-cluster: summaries with bottleneck identification
- Streaming lineage output (no 10K pair cap)
DuckDB Backend
- User-maintained DuckDB for out-of-core processing
read_table(),write_table(),list_tables()pip install goldenmatch[duckdb]
Streaming / CDC Mode
StreamProcessorfor incremental record matching- Immediate (per-record) or micro-batch modes
- Uses
match_one->add_to_clusterfor live updates
Multi-Table Graph ER
- Match within entity types, propagate evidence across relationships
- Iterative convergence with configurable propagation modes
- "If customer A's orders match customer B's orders, boost the A-B customer score"
Benchmarks
| Dataset | Strategy | Precision | Recall | F1 | Cost |
|---|---|---|---|---|---|
| DBLP-ACM | Weighted fuzzy | 97.2% | 97.1% | 97.2% | $0 |
| DBLP-ACM | Fellegi-Sunter | 98.8% | 57.6% | 72.8% | $0 |
| Abt-Buy | Embedding + ANN | 35.5% | 59.4% | 44.5% | $0 |
| Abt-Buy | Embedding + ANN + LLM | 95.4% | 50.9% | 66.3% | $0.04 |
Scale: 7,823 rec/s at 100K records. 792 tests passing.
Install
pip install goldenmatch
goldenmatch dedupe your_data.csv