Skip to content
master
Switch branches/tags
Code

Latest commit

The MinHash implementation used np.float for internal computations.
With Numpy 1.20 is np.float alias deprecated and generates the following
warning:

```
minhash.py:179: DeprecationWarning: `np.float` is a deprecated alias for
the builtin `float`. To silence this warning, use `float` by itself.
Doing this will not modify any behavior and is safe. If you specifically
wanted the numpy scalar type, use `np.float64` here.
```
5a15b17

Git stats

Files

Permalink
Failed to load latest commit information.

datasketch: Big Data Looks Small

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Jaccard Threshold
MinHash LSH Forest MinHash, Weighted MinHash Jaccard Top-K
MinHash LSH Ensemble MinHash Containment Threshold

datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

To install with Scipy for faster MinHashLSH initialization:

pip install datasketch[scipy]