Similarity scores for strings, vectors, points, and sets with a small, NumPy-first API.
pip install simmetry
pip install "simmetry[fast]"simmetry[fast]: enables optional Numba acceleration forpairwise(..., metric="euclidean_sim")andpairwise(..., metric="manhattan_sim")- ANN extras:
pip install "simmetry[ann-hnsw]"pip install "simmetry[ann-faiss]"
- Current package:
simmetryon PyPI - Current version in this repo:
1.0.3 - Maturity: Alpha (API may change; pin exact/minor versions in production)
- Versioning: semantic versioning target, but pre-hardening changes may still occur in minor releases until
1.xstabilizes
from simmetry import similarity
similarity("kitten", "sitting", metric="levenshtein")
similarity([1, 2, 3], [1, 2, 4], metric="cosine")
similarity((41.1, 29.0), (41.2, 29.1), metric="haversine_km")
similarity({1, 2, 3}, {2, 3, 4}, metric="jaccard")haversine_km returns geographic distance in kilometers.
import numpy as np
from simmetry import pairwise
X = np.random.randn(1000, 128)
S = pairwise(X, metric="cosine")import numpy as np
from simmetry import topk
X = np.random.randn(5000, 64)
q = np.random.randn(64)
idx, scores = topk(q, X, k=10, metric="cosine")from simmetry import available
available()
available("vector")
available("string")
available("point")
available("set")cosine,dot,euclidean_sim,manhattan_sim,pearson
levenshtein(normalized similarity)jaro_winklerngram_jaccard(character n-gram set Jaccard)token_jaccard(whitespace token set Jaccard)
euclidean_2dhaversine_kmpairwise_pointstopk_points
jaccard,dice,overlap
Auto mode is not random and not learned. It applies fixed type-based rules.
from simmetry import infer_metric, similarity
infer_metric("samplecorp", "sample corp") # "jaro_winkler"
infer_metric((41.0, 29.0), (41.1, 29.1)) # "haversine_km"
infer_metric({1, 2, 3}, {2, 3, 4}) # "jaccard"
similarity("samplecorp", "sample corp") # uses inferred metricSelection order:
list[str]/tuple[str](including empty lists) -> batch strings (jaro_winkler)str+str->jaro_winkler- 2-number tuples/lists ->
haversine_km set/frozenset->jaccard- numeric vectors ->
cosine - fallback ->
cosine
from simmetry.strings import pairwise_strings, topk_strings
S = pairwise_strings(
["item_one", "item_two"],
["item_one", "item_alt"],
metric="jaro_winkler",
)
idx, scores = topk_strings(
"samplecorp",
["samplecorp", "examplefinance", "testgroup"],
k=2,
metric="levenshtein",
)from simmetry.points import pairwise_points, topk_points
pts = [(41.0, 29.0), (41.01, 29.01), (40.9, 28.9)]
S = pairwise_points(pts, metric="haversine_km")
idx, scores = topk_points((41.0, 29.0), pts, k=2, metric="haversine_km")For very large vector corpora (100k+), exact topk() can be slow.
import numpy as np
from simmetry.ann import build_hnsw
X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)
index = build_hnsw(X, space="cosine")
labels, distances = index.query(X[0], k=10)import numpy as np
from simmetry.ann import build_faiss
X = np.random.randn(200_000, 128).astype("float32")
X /= np.linalg.norm(X, axis=1, keepdims=True)
index = build_faiss(X, metric="ip")
labels, scores = index.query(X[0], k=10)import numpy as np
from simmetry import SimIndex
X = np.random.randn(50_000, 128).astype("float32")
index = SimIndex(metric="cosine", backend="exact").add(X)
idx, scores = index.query(X[0], k=10)from simmetry import similarity
a = {"name": "Entity One", "city": "CityAlpha", "loc": (41.0, 29.0)}
b = {"name": "Entity One Extended", "city": "CityAlpha", "loc": (41.01, 28.99)}
score = similarity(
a,
b,
metric={"name": "jaro_winkler", "loc": "haversine_km"},
weights={"name": 0.7, "loc": 0.3},
)The project includes a benchmark harness in bench/run.py. Comparative benchmarks against rapidfuzz, scikit-learn, and ANN libraries are not published yet.
Run locally:
python bench/run.pyCurrent focus is a compact core with predictable APIs and optional ANN.
Planned additions (not implemented yet):
- String metrics: Hamming, BM25-style text ranking helpers, string-level Sorensen-Dice variants
- Published comparative benchmarks (RapidFuzz / sklearn / faiss baselines)
- Hosted docs site
MIT