A small toolkit for very efficiently comparing the similarity of large numbers of documents or other data structures that can be represented as sets. Core features:
- Very fast construction of MinHash "fingerprints" of sets. The algorithm is inspired by the MinHash implementation in datasketch, but the core MinHash algorithm is vectorized in numpy and includes CUDA support via cupy.
- Jaccard similarity index estimation
- Cardinality estimation (with bias correction for much better accuracy)
- Union operations between fingerprints
The module also includes some simple functions for quickly converting text to token or n-gram-based hashes.
- cupy and CUDA for GPU acceleration (optional)
I haven't made it pip installable yet, so you'll just have to grab the repository files for now.
How to use
The module is built around a
VectorizedMinHash hasher object. To construct fingerprint from a string using CUDA support:
from vectorizedMinHash import VectorizedMinHash,fastNGramHashes hasher = VectorizedMinHash(n_perm=256,mirror=True) testString = 'This is a test string' # 1) Construct a fingerprint from n-gram features # (fastNGramHashes converts bytes directly to n-gram ids by changing the stride of the dtype) ngramHashes = fastNGramHashes(testString.encode('ascii'),n=4) fingerprint = hasher.fingerprint(ngramHashes,cuda=True) # 2) Construct a fingerprint from tokens (split on whitespace by default) tokenHashes = tokenHashes(testString.encode('ascii')) fingerprint = hasher.fingerprint(ngramHashes,cuda=True) # 3) Construct a fingerprint from tokens (split on non-alphanumeric chars) tokenHashes = tokenHashes(testString.encode('ascii'),tokenRE=rb'\w+') fingerprint = hasher.fingerprint(tokenHashes,cuda=True)
You can also compute a fingerprint from a more generic array of integer ids (must be representable as a
# Construct a fingerprint from element ids ids = range(10) fingerprint = hasher.fingerprint(ids,cuda=True)
The constructor's most important parameter is
n_perm, which sets the size of the fingerprints. Larger fingerprints are more accurate, but require more processing time and memory to store.
mirror doubles the length of the fingerprint for a given
n_perm by using the max operation as in addition to min. In my experiments this saves processing time and improves accuracy, so it is the default setting.
A fingerprint is a simple
np.uint32 array. It doesn't remember what hasher made it, so be careful to only compare fingerprints made with exactly the same settings.
Merging fingerprints with union
Fingerprints can be merged with the
union function. This operation is equivalent to taking the union (or concatenating all the hash values) of the input sets before fingerprinting. Behind the scenes, it just stacks the fingerprints into an 2-d array and takes the min.
from vectorizedMinHash import union new_fingerprint = union([fingerprint0,fingerprint1,fingerprint2])
Fingerprints constructed via
union are indistinguishable from other fingerprints and it is perfectly valid to use them in jaccard or cardinality estimates.
Jaccard similarity index
The Jaccard similarity index between two sets can be estimated by the fraction of equal values in the corresponding fingerprints. The module contains two helper functions:
from vectorizedMinHash import jaccard, jaccardMatrix # Compute jaccard similarity index for two fingerprints: jac = jaccard(fingerprint0,fingerprint1) # Compute all pairwise similarities for three fingerprints jac_matrix = jaccardMatrix([fingerprint0,fingerprint1,fingerprint2])
The cardinality of a fingerprint (that is, the number of distinct hashes used to generate it) can be estimated with the
n = hasher.cardinality(fingerprint)
A Note on bias correction
Accurately estimating the cardinality of a set from its fingerprint is a little tricky. The method used in Datasketch MinHash has a huge downward bias, and more accurate methods usually involve completely different hashing algorithms (which can't simultaneously compute Jaccard similarity). The implementation in this module uses a simple bias-corrected maximum likelihood estimator to significantly increase the accuracy. The only complication is that the bias correction coefficients must be empirically estimated. The module has the required code at the end of the
__init__.py file. Pre-computed bias correction coefficients are stored in
bias_coef.npy, and should load automatically from the project root directory.