Skip to content
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble
Branch: master
Clone or download
Latest commit 260b800 Jul 12, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmark Custom Hash Function Interface, Removing hashobj (#80) Jan 6, 2019
datasketch update version Jul 11, 2019
docs update documentation May 26, 2019
docsrc update documentation May 26, 2019
examples LSH Ensemble (#25) Mar 31, 2017
test remember to copy hashfunc too #98 Jul 11, 2019
.gitignore Asynchronous sub module (#53) Jul 27, 2018
.travis.yml added bulk remove keys from LSH index (#70) Nov 2, 2018
DESCRIPTION.rst Fix MinHash serialization bug #18. (#23) Mar 15, 2017
LICENSE finish minhash Apr 1, 2015
MANIFEST.in add project files Mar 30, 2015
Makefile always create source distribution Oct 22, 2018
README.rst added bulk remove keys from LSH index (#70) Nov 2, 2018
setup.cfg add project files Mar 30, 2015
setup.py

README.rst

datasketch: Big Data Looks Small

https://travis-ci.org/ekzhu/datasketch.svg?branch=master

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Jaccard Threshold
MinHash LSH Forest MinHash, Weighted MinHash Jaccard Top-K
MinHash LSH Ensemble MinHash Containment Threshold

datasketch must be used with Python 2.7 or above and NumPy 1.11 or above. Scipy is optional, but with it the LSH initialization can be much faster.

Note that MinHash LSH also supports a Redis storage layer as well as an experimental module with asynchronous interface to MongoDB.

Install

To install datasketch using pip:

pip install datasketch -U

This will also install NumPy as dependency.

You can’t perform that action at this time.