Skip to content

Add embeddings support, modify Jaccard, enrich one-to-one filtering#97

Merged
kPsarakis merged 13 commits into
masterfrom
add_embeddings
May 5, 2026
Merged

Add embeddings support, modify Jaccard, enrich one-to-one filtering#97
kPsarakis merged 13 commits into
masterfrom
add_embeddings

Conversation

@chrisk21
Copy link
Copy Markdown
Collaborator

@chrisk21 chrisk21 commented May 4, 2026

Pull Request Summary

Overview

This branch adds three independent capabilities to the instance-matching path, the post-retrieval selection API, and the metric evaluation API. They can be reviewed (or, in the worst case, partially reverted) in isolation.

  1. Embedding-based variant of JaccardDistanceMatcher.
  2. Tversky-based unification of the set-similarity reduction (replacing the prior Jaccard-only path).
  3. Three named one-to-one selectors on MatcherResults with Hungarian as the new default, plumbed through the metrics API as a per-call algorithm choice.

1. Embedding-based Jaccard / Tversky distance [Closes #65]

Files: valentine/algorithms/jaccard_distance/__init__.py, valentine/algorithms/jaccard_distance/jaccard_distance.py, pyproject.toml, experiments/bench.py

What. New StringDistanceFunction.Embedding value: cosine similarity on sentence-transformer embeddings replaces the per-string distance for deciding which values "match" between two columns. The rest of the matcher (Tversky reduction, threshold, Match output) is unchanged.

Design choices, each exposed as a constructor knob:

  • embedding_model: str = "all-MiniLM-L6-v2" — small (23 MB / 384-dim), CPU-friendly default.
  • embedding_device: str | None = None — auto-picks cudampscpu. Pass "cpu" / "cuda" / "cuda:1" / "mps" to force.
  • embedding_batch_size: int | None = None — when unset, sentence-transformers' library default (32) is used; pass an explicit value for capable hardware.

Performance shape.

  • sentence_transformers is lazy-imported inside _load_sentence_transformer, which is lru_cache-d on (model, device). Module-level import works without the optional extra installed; calling the embedding branch without it raises a clean ImportError pointing at pip install 'valentine[embeddings]'.
  • One global encode pass per get_matches / get_matches_batch call: every unique string across every column of every table is encoded once via a single batched model.encode(...), then per-column embeddings are sliced out of the shared array. This avoids re-encoding shared columns when matching multiple table pairs.
  • get_matches_batch is overridden so multi-table calls share one encoding pass.
  • L2-normalised embeddings → per-pair cosine = a single matmul; the existing per-row top-1-vs-threshold reduction ((sims >= threshold).any(axis=...)) carries over unchanged.

Dependency. New embeddings extra in pyproject.toml (sentence-transformers>=2.0,<6.0). Not added to base deps.

What we tried and dropped: fp16 (embedding_dtype). On both CPU and MPS the bench was 8.3× slower than fp32 because the per-pair similarity step lives in NumPy, which has no hardware-accelerated fp16 matmul; the fp16 win on encode is dwarfed by the matmul loss. Properly using fp16 would require keeping embeddings as torch tensors throughout — a bigger architectural change deferred as future work. The flag and code path were removed cleanly.

Bench impact (NYU full, default model, batch=128, threshold=0.7, MPS auto-detected):

variant F1 R@|GT| MRR wall
Jacc-Lev (existing, default) 0.646 0.616 0.702 2.6s
Jacc-Emb (new) 0.657 0.636 0.747 11.0s

Real F1 / MRR uplift over Levenshtein at ~4× the wall time. Conditionally added to the bench (JaccardDistanceMatcher_emb) only when sentence-transformers is importable.


2. Tversky as the unified set-similarity reduction

Files: valentine/algorithms/jaccard_distance/__init__.py, valentine/algorithms/jaccard_distance/jaccard_distance.py

What. The directional-match-counts → pair-similarity reduction is now an asymmetric Tversky index, max-symmetrised across the two directions:

T(A, B; α, β) = |A∩B| / (|A∩B| + α·|A−B| + β·|B−A|)
score = max(T(A, B; α, β), T(B, A; α, β))

Constructor now exposes tversky_alpha: float = 1.0 and tversky_beta: float = 1.0. Defaults give exactly Jaccard and reduce to the previous code's behaviour. Other operating points:

  • α = β = 1.0 → Jaccard (default).
  • α = β = 0.5 → Sørensen-Dice.
  • α = 1.0, β = 0.0 (or vice versa) → set containment, max(|∩|/|A|, |∩|/|B|).
  • Anything in between is a tunable Tversky.

Why. The previous code had a binary set_similarity flag (Jaccard / Containment) that we briefly experimented with and removed. Tversky is a strict superset that subsumes both options under a single principled API and gives the user fine-grained control (e.g. asymmetric Tversky for subset/superset workloads) without adding new enums.

Implementation note. Both directional match counts (a_match, b_match) are read off the same value-level score matrix in one pass — no extra compute compared to the prior single-direction reduction. The rapidfuzz path keeps its score_cutoff=threshold optimisation; the embedding path uses the same matmul.

What we tried and dropped: A MatchWeighting enum with Margin weighting (each value's contribution = top1 − top2 margin instead of binary). It regressed F1 / R@|GT| / MRR across the bench because it double-penalised real matches with near-tied alternatives. Removed cleanly; only the Binary path remains, inlined, with no flag.

Bench impact (NYU full): Default (α=β=1.0) is bit-identical to prior Jaccard behaviour. Symmetric Tversky variants (α=β between 0.5 and 1.0) are rank-equivalent under one-to-one selection, so they produce identical F1 / R@|GT| / MRR. Containment (α=1, β=0) regresses F1 by ~12pp on NYU because asymmetric scoring inflates similarity for size-asymmetric pairs — included for users with subset/superset workloads (dataset discovery, etc.) where it's the right tool.


3. One-to-one selection — three named methods, pluggable per call

Files: valentine/algorithms/matcher_results.py, valentine/metrics/base_metric.py, valentine/metrics/metric_helpers.py, valentine/metrics/metrics.py, tests/*, examples/*, docs/*

What. Three explicitly-named selectors on MatcherResults:

  • one_to_one_hungarian(threshold=None) — globally optimal 1:1 assignment via scipy.optimize.linear_sum_assignment. Each source and target appears in at most one returned pair; the assignment maximises total similarity over all valid 1:1 assignments. Threshold semantics match the prior one_to_one (median by default). Result is cached on _cached_hungarian. This is the new default 1:1 selector for the metrics API.
  • one_to_one_greedy(threshold=None) — the prior greedy implementation, kept under an explicit name for tests that pin specific outputs and for users who need the legacy behaviour. Not cached.
  • one_to_one_mutual_top(n=1) — keeps pair (s, t) only if t is among s's top-n targets AND s is among t's top-n sources. With n=1 this is the classic mutual nearest-neighbour filter.

The previous one_to_one() method is removed (renamed). The previous _cached_one_to_one attribute is renamed to _cached_hungarian.

Pluggable algorithm choice in the metrics API.

The choice of 1:1 algorithm is now a per-call argument rather than hardcoded inside apply(). Specifically:

  • A new OneToOneMethod literal ("greedy" | "hungarian" | "mutual_top") lives in valentine/metrics/base_metric.py.

  • Every concrete metric's apply(matches, ground_truth, one_to_one_method="hungarian") now takes the algorithm as a parameter and dispatches via the helper _apply_one_to_one (valentine/metrics/metric_helpers.py); unknown values raise ValueError.

  • MatcherResults.get_metrics(...) accepts the same one_to_one_method parameter and threads it into every metric in the set, so users can pick the algorithm in one place:

    matches.get_metrics(gt, metrics={F1Score()}, one_to_one_method="mutual_top")
  • The default is "hungarian" end-to-end, so existing callers see no behavioural change. Metrics with one_to_one=False (e.g. MeanReciprocalRank, RecallAtSizeofGroundTruth) ignore the argument.

Why. Greedy bipartite matching can lock in a locally-best pair that blocks a globally-better assignment; Hungarian cannot, and scipy is already a transitive dependency, so Hungarian is free. The mutual-top-n filter fills a third operating point on the precision/recall curve. Surfacing the algorithm as a per-call argument lets users explore precision/recall tradeoffs without instantiating per-config metric variants — useful both for downstream code and for the bench harness.

Bench impact (per-selector F1 on three matchers, NYU full):

matcher greedy hungarian (default) mutual_top
Coma_Inst 0.761 0.772 0.762
Jacc-Lev 0.635 0.646 0.696
Jacc-Emb 0.647 0.657 0.696

Hungarian is consistently +0.010 F1 over greedy with the same output shape. Mutual top-1 is the F1 winner on instance-level matchers (+0.04–0.06 over Hungarian) at the cost of recall. R@|GT| and MRR are unchanged across selectors (they're retrieval-quality metrics on the full ranked output — selector-independent by design).

API breaks.

  • MatcherResults.one_to_one() is gone. Every caller in this repo has been migrated.
  • Metric.apply() and MatcherResults.get_metrics() gain a new keyword-only argument with a default. Existing positional callers are unaffected. Custom Metric subclasses that override apply will need to accept the new one_to_one_method keyword (or **kwargs).

Migration footprint inside the repo:

  • valentine/metrics/metrics.py: six apply() methods now take and use one_to_one_method; hardcoded matches.one_to_one_hungarian() calls are gone.
  • valentine/metrics/metric_helpers.py: new _apply_one_to_one dispatcher.
  • tests/test_matcher_results.py, tests/test_coverage_gaps.py, tests/test_distribution_based_benchmark.py, tests/test_docs_smoke.py: renamed test methods, updated calls, updated cache-attribute references.
  • examples/valentine_example_pandas.py, valentine_example_polars.py, valentine_example_mixed.py: updated to the new default.
  • README.md, docs/api.md, docs/example.md, docs/faq.md, docs/results.md, docs/metrics.md, docs/changelog.md: prose, code samples, link anchors all updated. docs/api.md has full sections for all three new selectors.

Downstream users will need to migrate to one_to_one_hungarian() (recommended — better default) or one_to_one_greedy() (preserve previous behaviour). The rename is justified by the better default and by the project being on 1.0.0.dev0.


Tests

All 248 existing tests pass; 24 polars tests skip (extra not installed in CI); 6 doctests pass. The Tversky default (α=β=1.0) is bit-identical to the previous Jaccard implementation, and the metric API default ("hungarian") is bit-identical to the prior hardcoded path, which is why the regression suite stayed green throughout. The embedding branch is exercised by the bench harness (when sentence-transformers is available); we did not add unit tests for it — that's a reasonable follow-up.

Suggested review focus

  1. The one_to_one() rename and the Metric.apply signature change are both public-API breaks. Justified by the better default and by the major version on the horizon, but worth a deliberate sign-off.
  2. The fp16 dead-end in the embedding section is documented in PR text and code comments; no orphaned code, but worth confirming the diagnosis (NumPy has no hardware-accelerated fp16 matmul) before closing the door on it.
  3. Tversky default = Jaccard means existing Jaccard-only callers are unaffected. The new α/β knobs are purely additive. Same goes for the one_to_one_method argument — purely additive, default preserves behaviour.

…eric tversky index, add more one-to-one filtering methods
@chrisk21 chrisk21 requested a review from kPsarakis May 4, 2026 13:42
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 94.95413% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.13%. Comparing base (14cbba1) to head (7d0e045).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...lgorithms/distribution_based/distribution_based.py 80.00% 2 Missing and 3 partials ⚠️
...ne/algorithms/jaccard_distance/jaccard_distance.py 97.77% 1 Missing and 1 partial ⚠️
valentine/algorithms/matcher_results.py 97.29% 1 Missing and 1 partial ⚠️
valentine/metrics/metrics.py 85.71% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
- Coverage   95.30%   95.13%   -0.18%     
==========================================
  Files          53       53              
  Lines        2621     2794     +173     
  Branches      399      440      +41     
==========================================
+ Hits         2498     2658     +160     
- Misses         75       83       +8     
- Partials       48       53       +5     
Flag Coverage Δ
unit 95.13% <94.95%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...lentine/algorithms/distribution_based/discovery.py 100.00% <100.00%> (ø)
valentine/algorithms/jaccard_distance/__init__.py 100.00% <100.00%> (ø)
valentine/metrics/base_metric.py 93.75% <100.00%> (+0.41%) ⬆️
valentine/metrics/metric_helpers.py 100.00% <100.00%> (+5.40%) ⬆️
...ne/algorithms/jaccard_distance/jaccard_distance.py 98.44% <97.77%> (-1.56%) ⬇️
valentine/algorithms/matcher_results.py 98.08% <97.29%> (-0.85%) ⬇️
valentine/metrics/metrics.py 88.54% <85.71%> (ø)
...lgorithms/distribution_based/distribution_based.py 95.68% <80.00%> (-3.34%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kPsarakis kPsarakis assigned chrisk21 and unassigned chrisk21 May 5, 2026
@kPsarakis kPsarakis merged commit b822900 into master May 5, 2026
40 of 41 checks passed
@kPsarakis kPsarakis deleted the add_embeddings branch May 5, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add embedding-based methods

2 participants