Skip to content

v0.3.0 -- LLM Scoring, Fellegi-Sunter, Plugins, Connectors, Streaming, Graph ER

Choose a tag to compare

@benzsevern benzsevern released this 21 Mar 22:48
· 1612 commits to main since this release

What's New

LLM Scorer with Budget Controls

  • GPT-4o-mini scores borderline pairs, boosting product matching from 44.5% to 66.3% F1 (precision 35% -> 95%) for $0.04
  • Budget caps (max_cost_usd, max_calls), model tiering, graceful degradation
  • Three-tier: auto-accept (>0.95), LLM judge (0.75-0.95), auto-reject (<0.75)

Fellegi-Sunter Probabilistic Model

  • EM-trained m/u probabilities with Splink-style training (fix u from random pairs, train only m)
  • Comparison vectors with 2/3/N levels, automatic threshold estimation
  • 98.8% precision on DBLP-ACM -- opt-in for high-precision use cases

Plugin Architecture

  • Extend with custom scorers, transforms, connectors, and golden strategies
  • Entry-point discovery: pip install goldenmatch-my-plugin auto-registers
  • Protocol classes: ScorerPlugin, TransformPlugin, ConnectorPlugin, GoldenStrategyPlugin

Learned Blocking

  • Auto-discovers blocking predicates from a sample run
  • Evaluates recall vs reduction ratio, selects best rules
  • 96.9% F1 matching hand-tuned static blocking on DBLP-ACM

Enterprise Connectors

  • Snowflake, Databricks, BigQuery, HubSpot, Salesforce
  • Optional deps: pip install goldenmatch[snowflake]
  • Credentials via environment variables

Explainability

  • Template-based natural language explanations (zero LLM cost)
  • Per-pair: "Matched because names are phonetically identical, zip codes match exactly"
  • Per-cluster: summaries with bottleneck identification
  • Streaming lineage output (no 10K pair cap)

DuckDB Backend

  • User-maintained DuckDB for out-of-core processing
  • read_table(), write_table(), list_tables()
  • pip install goldenmatch[duckdb]

Streaming / CDC Mode

  • StreamProcessor for incremental record matching
  • Immediate (per-record) or micro-batch modes
  • Uses match_one -> add_to_cluster for live updates

Multi-Table Graph ER

  • Match within entity types, propagate evidence across relationships
  • Iterative convergence with configurable propagation modes
  • "If customer A's orders match customer B's orders, boost the A-B customer score"

Benchmarks

Dataset Strategy Precision Recall F1 Cost
DBLP-ACM Weighted fuzzy 97.2% 97.1% 97.2% $0
DBLP-ACM Fellegi-Sunter 98.8% 57.6% 72.8% $0
Abt-Buy Embedding + ANN 35.5% 59.4% 44.5% $0
Abt-Buy Embedding + ANN + LLM 95.4% 50.9% 66.3% $0.04

Scale: 7,823 rec/s at 100K records. 792 tests passing.

Install

pip install goldenmatch
goldenmatch dedupe your_data.csv